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Zusammenfassung 

Bci dcr Anwcndung von Reinforcement Learning Algorithmen auf die ph- 
ysische Welt muss man einen Weg findcn, aus komplexen Scnsordatcn Umgc- 
bungszustandc hcrauszufiltcrn. Obwohl die meisten bisherigen Ansatze hierfiir 
Heuristiken verwendcn, legt die Biologie nahe dass es eine Mcthodc geben muss, 
die solche Filter selbststandig konstruicren kann. 

Nebcn der Extraktion von Umgcbungszustandcn miissen dicse auch in einer 
fur modernc Reinforcement Algorithmen brauchbaren Form prasentiert werden. 
Viele dieser Algorithmen arbeiten mit linearen Funktionen, so dass die Filter 
gute lineare Approximationseigenschaften aufweisen solltcn. 

Diese Diplomarbeit mochte eine unuberwachte Lernmethode namens Slow 
Feature Analysis (SFA) vorschlagen, um diese Filter zu generieren. Trainiert 
mit einer Zufallsfolge von Sensordaten kann SFA eine Serie von Filtern erlernen. 
Theoretische Betrachtungen zeigen dass diese mit steigender Modellklasse und 
Trainingbeispielen zu trigonometrischen Funktionen konvergieren, welche fur 
ihre guten linearen Approximationseigenschaften bekannt sind. 

Wir haben diese Theorie mit Hilfe eines Roboters evaluiert. Als Aufabe 
soil der Roboter das Navigieren in einer einfachen Umgebung erlernen, wobei 
dcr Least Squares Policy Iteration (LSPI) Algorithmus zum Einsatz kommt. 
Als einzigen Sensor verfiigt der Roboter iiber eine auf seinem Kopf monticrtc 
Kamera, deren Bilder sich aber aufgrund ihrer Komplexitat nicht direkt als 
Umgebungszustande eignen. Wir konnten zeigen dass LSPI dank der durch 
SFA generierten Filter eine Erfolgsrate von ca. 80% erreichen kann. 
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Abstract 

The application of reinforcement learning algorithms onto real life problems 
always bears the challenge of filtering the environmental state out of raw sensor 
readings. While most approaches use heuristics, biology suggests that there 
must exist an unsupervised method to construct such filters automatically. 

Besides the extraction of environmental states, the filters have to represent 
them in a fashion that support modern reinforcement algorithms. Many popular 
algorithms use a linear architecture, so one should aim at filters that have good 
approximation properties in combination with linear functions. 

This thesis wants to propose the unsupervised method slow feature analysis 
(SFA) for this task. Presented with a random sequence of sensor readings, SFA 
learns a set of filters. With growing model complexity and training examples, the 
filters converge against trigonometric polynomial functions. These are known 
to possess excellent approximation capabilities and should therfore support the 
reinforcement algorithms well. 

We evaluate this claim on a robot. The task is to learn a navigational 
control in a simple environment using the least square policy iteration (LSPI) 
algorithm. The only accessible sensor is a head mounted video camera, but 
without meaningful filtering, video images are not suited as LSPI input. We 
will show that filters learned by SFA, based on a random walk video of the 
robot, allow the learned control to navigate successfully in ca. 80% of the test 
trials. 
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Notational conventions 

Algebra notation 

x Variable x 

x Column vector x 

[x\ , . . . , x n ] An n dimensional row vector with entries Xi 

A vector with all entries being 

1 A vector with all entries being 1 
M Matrix M 

I The identity matrix 

M T The transpose of matrix M e JR nxm : Vij : = M ri 

n 

tr(M) The trace of matrix M € TR nxn : tr(M) = £ M u 

i=l 

I I a; 1 1 Arbitrary norm of vector x 

\\x\\2 L2-norm of vector x G H n : ||aj||2 = \ 53 x i 

y »=x 

IMIlp Frobenius norm of matrix M G H" 



>nxm. 



|M|| P - JE E Mf 3 = Vtr(MTM) = v/tr(MMT) 
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Analysis notation 



A 


Set A 





The empty set 


[a,b] 


The real interval between a and b: [a, b] = {x\a < x < b} C IR 


Ax-B 


The set of all tuples (a, b) with a G A and 6 G 23 


\A\ 


The cardinality of set .A 


3* (A) 


The power set of set .A 




Function / : A -> S 


/(*) 


Multivariate function / : .A -> 23" 


dom /(x) 


The domain A of function / : A — > 23 


v/(x)| xo 


The gradient of function / : .A — > 23 with respect to 




its argument x G A at position xq: V/(x)| 2:o = \ Xo 



Statistics notation 

E[x] The expectation of the random variable x with 

distribution function p(x): E[x] = Jxp(x)dx 
E t [f(t)] The expectation over all n realizations of variable i, 

normally the empirical mean over index t: E t [/(t)] = i X)t=i /W 
C[x, y] The covariance between random variables x and y with 

joint distribution function p(x, y): 

C[x,y] =cov(x,y) = J J(x - E[x]) (y - E[y]) p(x, y) dxdy 
C[x] The covariance matrix of the random vector x, 

i.e. the covariance between all entries: (C[cc]).y = C[x,,Xj] 
V[x] The variance of the random variable x: V[x] = C[x, x] 

x ~ p(y, •) The random variable x G X is drawn according to the distribution 

function p : y x X — > [0, 1], where Vy G y : iK^/i 2; ) = 1 



VII 



List of figures 



Figure 11.11 Proposed methodology. Page [3] 

Figure I3~T1 Example value function. Page [2"31 

Figure 13.21 Example Q-value function. Page l24l 

Figure PI Results from Bcrkes and Wiskott (38]. Page [39] 

Figure PI Handwritten digits from the MNIST database. Page [39] 

Figure 14.31 Experimental setup of Franzius et al. [42] Page [40] 

Figure 14.41 SFA responses of Franzius et al. [32] . Page 2JJ 

Figure 14.51 Results from Wyss et al. [41] . Page |4T1 

Figure l4~6l Artificial and measured place cells. Page [42] 

Figure l5~Tl Experiment overview. Page [48] 

Figure EU The Pioneer 3DX robot. Page [49] 

Figure 1ST31 Illumination dependency of video images. Page [SU] 

Figure [S~4l Wall textures in the experiment. Page [50] 

Figure [S~5l Simulated experiment overview. Page lBTl 

Figure lS~6l Alternative environment overview. Page [52] 

Figure IBTl Prepared video examples. Pagel53l 

Figure 15.81 Image preparation overview. Page 1541 

Figure IS~9"1 Slowness vs. model complexity. PagelBTl 

Figure [5. 101 Slowness for kernel SFA. PagelBTl 

Figure IB". Ill SFA responses on test sets. Page [58] 

Figure IB". 121 Unmixed first SFA response. Page [58] 

Figure 15.131 Mixed second and third SFA response. Page [59] 

Figure IB". 141 Alternative environment SFA responses. Page 15(71 

Figure [5. 151 Policy iteration training set and reward. PagelBTl 

Figure IB. 161 Alternative environment training set and reward. Page [62] 
Figure IB. 171 Convergence quality for artificial state representation. Page [65] 

Figure IB. 181 Convergence quality for SFA state representation. Page [66] 

Figure [5. 191 Test trajectories of the robot. Page [55] 
Figure IB. 201 Convergence quality of the alternative environment. Page lBTl 



VIII 



List of algorithms 



Algorithm [TJ Linear least squares regression Page [TO] 

Algorithm [21 Linear ridge regression Page [IT] 

Algorithm [3] Kernelized ridge regression Page [14] 

Algorithm[U Greedy support vector selection algorithm Page [18] 

Algorithm EJ LSTD Page [28] 

Algorithm H LSQ Page [29] 

Algorithm |3 LSPI PagcEJJ 

Algorithm [8j Control Page [32] 

AlgorithmEJ Linear SFA Page [43] 
Algorithm [TO] Kernel SFA with kernel matrix approximation Page [45] 

Algorithm [TTJ Reference policy Page [64] 



Contents 



1 Introduction 

1 . 1 Motivation 

1.2 Method 

1.3 Experiments 

2 Preliminaries 

2.1 Regression 

2.1.1 Optimization problem 

2.1.2 Convexity 

2.1.3 Function classes 

2.1.4 Validation 

2.1.5 Linear regression algorithms 

2.2 Kernel techniques 

2.2.1 The kernel trick 

2.2.2 Kernelized regression 

2.2.3 Kernels 

2.2.4 Kernelized covariance eigenvalue problem 

2.2.5 Kernel matrix approximation 

2.2.6 Support vector selection 

3 Reinforcement Learning 

3.1 Introduction 

3.1.1 Markov processes 

3.1.2 State representation 

3.1.3 Trigonometric polynomials 

3.2 Value Estimation 

3.2.1 LSTD 

3.2.2 LSQ 

3.3 Policy Iteration 

3.3.1 Sampling 

3.4 Conclusion 



IX 



X 



CONTENTS 



4 Slow Feature Analysis 

4.1 Introduction 

4.1.1 Optimization problem 

4.1.2 Optimal responses . . 

4.2 Applications 

4.2.1 Receptive fields .... 

4.2.2 Pattern recognition . . 

4.2.3 Place cells 

4.3 Algorithms 

4.3.1 Linear SFA 

4.3.2 Expanded SFA .... 

4.3.3 Kernel SFA 

4.4 Conclusion 

5 Empirical validation 

5.1 Experiment description . . . 

5.1.1 Overview 

5.1.2 Formal description . . 

5.1.3 Robot 

5.1.4 Environment 

5.1.5 Simulator 

5.1.6 Video generation . . . 

5.1.7 Alternative approaches 

5.2 Learn preprocessing 

5.2.1 Slowness 

5.2.2 Responses 

5.2.3 Other environments . 

5.2.4 Discussion 

5.3 Learn navigation 

5.3.1 Policy evaluation . . . 

5.3.2 Artificial input .... 

5.3.3 Video input 

5.3.4 Other environments . 

5.3.5 Discussion 

5.4 Conclusion 

6 Conclusion 

6.1 Summary 

6.2 Discussion 

6.3 Outlook 



Chapter 1 

Introduction 



Modern robotics faces a major drawback. 

Over the last decades, hard- and software has matured into commercially 
applicable products. Nowadays, robotic control is able to cross deserts, navigate 
independently on other planets and even climb stairs on two legs. That is, if 
the robot is fine tuned to the course, operators stand by to correct inevitable 
jams and stairs have the right height and form. Modern robotic control is not 
very adaptive. 

The scientific discipline of adaptive control, on the other hand, is developed 
and tested mainly on discrete, small sized toy examples. Complex sensory input, 
uncertain and noise afflicted, is not compatible with these standard methods. 
A natural way to merge these two technologies is to find a preprocessing - a 
process that strips the essential information from raw sensor data. 

But what is essential information? Surely we can only answer this question 
in context of the individual control domain at hand. But whatever choice we 
make, we can be certain that we will forget something. An ideal preprocessing 
would also adapt to the control domain on its own. 

In this thesis we want to investigate the properties of an unsupervised tech- 
nique, called slow feature analysis, as an auto-adaptive preprocessing in the 
domain of environment specific navigational control. 

1.1 Motivation 

Roughly 35 years ago, biologists discovered a cluster of cells in the hippocampal 
area of rodents, which encode the spatial position of the animal. Within minutes 
of seemingly random movement in an unknown environment, the cells specialize 
to fire only around one position each. Moreover, the cell population covers the 
whole accessible area, which lead many scientists to believe that the so called 
place cells are a preliminary stage of the rodents navigational control [44] . 

How these cells adapt, on the other hand, is still a question open to debate. 
Until recently, the only explanation was the so called path integration, a tech- 
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nique in which the rodent integrates its movement up to estimate the current 
position. Since the sense of movement can be flawed, small mistakes will sum 
up over time and have to be corrected by external stimuli, which have to be 
identified in the environment, first. The computational approach to this prob- 
lem, used in robotics, is called simultaneous localization and mapping (SLAM) 
and will be discussed as an alternative to our approach in section 15. 1.71 

Last year, Franzius et al. [42] were able to show that a memoryless feed 
forward network is able to learn place cell behaviour. The network adapts to 
videos of a head mounted camera (substitutional for the rodents eyes) by an 
unsupervised learning technique called slow feature analysis (SFA). 

Reinforcement learning (or neuro- dynamic programming) is a method to 
learn a control based on reward and punishment. A set of rewarded/punished 
example movements is generalized to estimate the expected sum of future re- 
wards (value) at every position and for every possible action. Given a current 
position, the so called state of the system, the control chooses the action that 
promises the highest value. Obviously, the efficiency of this approach depends 
on how well the value can be estimated, which in return depends on the coding 
of the state. 

Place cells provide an intuitive coding for linear architectures. Weighting 
every cells output with the mean value of its active region can summed up 
approximate any value function up to a quality depending on the number of 
cells. Therefore, the place cells of Franzius et al. should be a natural (and 
biological plausible) preprocessing for linear value estimators. 

As it turns out, properly trained SFA produces a mapping of video images 
into corresponding trigonometric polynomials in the space of robot positions. 
Franzius' place cells were products of an additional step independent component 
analysis (ICA) which does not influence the linear approximation quality and 
can therefore be omitted. 

The goal of this thesis is to formulate this basic idea into a working procedure 
and to demonstrate its soundness in a real world robot navigation experiment. 

1.2 Method 

We want to learn a preprocessing using slow feature analysis (SFA) out of the 
video images of a robots head mounted camera. This preprocessing should 
represent the robots position and orientation (its state) in a fashion suitable 
for linear function approximation. With this state at hand, we want to use the 
reinforcement learning method policy iteration (PI) to learn a control for the 
robot. 

Note that the preprocessing is adapted to one environment, e.g. one room 
only, and will not work anywhere else. However, the same is true for any control 
learned by reinforcement learning, so SFA fits well within this framework. 

Both SFA and PI need an initial random walk, crossing the whole environ- 
ment. Therefore the robot has to drive around by choosing random actions, 
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first. The only needed sensor information of this phase is the video of a head 
mounted camera. However, for the reinforcement method, we also need to record 
the reward and punishment at every step. As simple task, the robot receives 
reward for entering a goal area and punishment for getting too close to walls. 
The resulting control should drive into the goal area as quickly as possible, while 
keeping its distance to walls. 

Secondly, SFA will learn a series of filters based on the recorded video. The 
output of these filters, applied on the initial video, is used as input of policy 
iteration. PI estimates the expected sum of future rewards (value) for every 
action and position. 

Control With the value estimator at hand, or to be precise its parameter 
vectors w^ 1 ', . . . ,u/ a ) (one for every action), the control works as depicted in 
figure 11.11 

(a) The robot starts in an unknown position and wishes to navigate into a 
goal area, marked with a G for demonstration purposes. 

(b) A head mounted video camera shoots a picture I{x, y, 9) out of the current 
perspective. 

(c) The series of filters . . . <f) n , learned by slow feature analysis, produce one 
real valued output </>i(x, y, each. For every possible action, the output 
is multiplied by a weight vector w, found by reinforcement learning. The 
sum of the weighted outputs is the value estimation V a (x,y,9) of this 
action. 

(d) The robot executes the action with the highest value and repeats the 
procedure until it reaches the goal. 



4 



CHAPTER 1. INTRODUCTION 



Multiple tasks The reinforcement method employed in this thesis, least squares 
policy iteration (LSPI), chooses the most promising in a finite number of ac- 
tions a, based on the current state x <G H d . Unfortunately, its complexity is 
0(d 3 a 3 ) in time and 0(d 2 a 2 ) in space. An efficient kernel SFA algorithm itself 
has a complexity of at least 0(m 3 ) in time and 0(m 2 ) in space, where m is 
the number of support vectors (comparable to d). So why not use a kernel ver- 
sion of LSPI when there is no significant advantage to the proposed method in 
computational complexity? 

For once, LSPIs complexity also depends on the number of actions a. An 
SFA preprocessing can use the available memory up to 0(m 2 ) to produce a 
number of filters d << m. This way one can consider a much larger number 
of actions. More importantly, the state space extracted by SFA can be used 
in more than one reinforcement problem. For example, when the robot should 
learn two tasks, one can learn both given the same initial video but different 
rewards. Therefore, determining the preprocessing once allows to learn multiple 
tasks quickly afterwards. 

Theory In chapter [3] the theoretical background of reinforcement learning is 
presented as well as the complete derivation of LSPI with all necessary algo- 
rithms. 

Chapter |4] describes slow feature analysis theoretically. It also contains an 
overview of recent applications and algorithms, including a novel derivation of 
a kcrnclizcd algorithm. 

Both chapters are based on common concepts of machine learning and kernel 
techniques, which are introduced for the sake of completeness in chapter [2j 

1.3 Experiments 

Within the NeuRoBot project, the author was able to perform experiments 
with a real robot. To check the theoretical predictions, we constructed a rect- 
angular area with tilted tables, in which the robot should navigate towards 
some virtual goal area. We tested two goal areas, which were not marked or 
discriminable otherwise besides the reward given in training. 

To evaluate the proposed method less time consuming, the author also im- 
plemented a simulated version of the above experiment. At last, theoretical 
predictions exist only for rectangular environments. To test the behaviour out- 
side this limitation, we used the simulator to create an environment consisting 
of two connected rooms. 

A detailed description of these experiments, as well as a thorough analysis 
of both SFA and LSPI under optimal conditions, can be found in chapter [5l 
The authors conclusions and an outlook of possible future works are presented 
in chapter [5] 



Chapter 2 

Preliminaries 



In this chapter we introduce the reader to the general concepts of machine 
learning, especially to linear models (at the example of least squares regression) 
in section [2~T1 and kernel techniques (with some less common procedures we will 
need in this thesis) in section |2~2"1 

We will start with a general introduction of regression (sec. 12.11 and l2.1.ip . 
followed by a discussion of convexity (sec. 12.1. 2[) . model choices (sec. I2.1.3[) 
and validation techniques (sec. I2.1.4[) . On this basis, we will finally define two 
commonly used linear regression algorithms: least squares regression and ridge 
regression (sec. I2.1."5)) . 

In the second part, we will give an overview on kernel techniques (sec. l2.2l and 
I2.2.ip . demonstrate them on a kernel regression algorithm fsec. 12. 2/2")) . discuss a 
kernclizcd version of the covariance eigenvalue problem (sec. !2.2!4|) and end with 
kernel matrix approximation methods (sec. I2.2.5[) and a corresponding support 
vector selection algorithm (sec. 12.2 ."G]) . 

2.1 Regression 

We live in a world where things follow causal relationships, which should be 
expressible as functions of observations (i.e. products of the real "causes"). 
The exact nature and design of these relationships are unknown but might 
be inferred from past observations. To complicate things even further, these 
observations can also be afflicted by noise, i.e. random distortions independent 
of the real "cause". Regression deals with inference of functional relationships 
from past observations, based on some simplifying assumptions. 

Assumption 2.1 The random variable t 6 T depends on the observation x G X 
by t = f(x) + 7] with r\ € T being another zero mean random variable. 

We assume that some kind of observation yields an input sample x, which can 
be assigned a target value by t by an expert or process of nature. Due to errors 
in this process, t is distorted by a random variable r\, independent of x. 
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Estimating the volume inside a balloon based on its diameter would be 
an example. Clearly there is a relationship, but since real balloons do not 
follow ideal mathematical shapes, it is hard to find. Physicists would take 
measurements of balloon diameters (the input samples) and air volumes (the 
target values) to find a function which explains most measurements with a 
minimum amount of noise. Since measurements are prone to be inexact and 
the rubber of individual balloons differs slightly, we have to expect some noise 
in the target values. An exact reproduction of the observed target values will 
therefore reproduce measure errors and mispredict unseen future inputs. The 
physicists approach to this problem is formulated as Ockhams razor, which in 
short states that theories have to be simple as well as explanatory. 

In the following, we will denote S = {(cc (1) , t (1) ), {x {n \ * (,l) )} £ #(XxT) 
as the training set. Regression aims to estimate the function f(x) based only 
on this data. The balance of Ockhams razor is difficult to maintain and led to 
a variety of regression algorithms. 

2.1.1 Optimization problem 

We wish to find a function y : X — > 7 which estimates the real (but unknown) re- 
lationship /. The only knowledge available is the training set §, so we construct 
a cost function C which defines the optimization goal indirectly by comparing 
the target values with the predictions made by some given function y. 

Definition 2.2 (Cost function) 

A cost function has the form C : (X — > T) x ,^(X x T) — > IR. 

The intuition is that the cost function is small if / is applied, and hopefully 
also small for functions similar to /. Minimizing C with respect to y should 
therefore lead to a similar function. 

The approach suggesting itself is to evaluate y at every sample x^' and 
consider the distance under some norm 1 1 • 1 1 to target t^' as the empirical cost 
for this sample. Because we have no reason to favor any sample, we sum up the 
individual costs and the result is called the empirical cost function [9] : 

n 

C em r(y,S)= y £\\y(x^)-t^\\ (2.1) 

3=1 

Note that, due to the norm, the noise term 77 in assumption 12.11 will not cancel 
out. The only way for C to reach zero is to reproduce the noise of t in y, which 
is not desirable. 

If we wish to avoid this problem, we can invoke Ockhams razor and penalize 
complex functions with a regularization term R(y) £ IR. A less complex function 
will be less likely to reproduce the observed noise and therefore predict unseen 
data better. Examples for R would be the length of the parameter vector (linear 
functions), the number of support vectors (kernel machines) or the number of 
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hidden neurons (neuronal networks) . Together with the empirical cost we obtain 
the regularized cost function [9] : 

n 

c r ^( y ,B) = \\y(x (j) ) - + R(y) (2-2) 

3=1 

Note that the regularization term R(y) is independent of the sample size n and 
with growing n will loose relevance. Section 12.1.41 will discuss the effects of 
sample size in more detail. 

Definition 2.3 (Optimal function) The global minimum of a cost function 
C(y,§) with respect to y is called optimal function y* of C: 

y* = argminC(y,S) (2.3) 

y 

The optimal function y* of C is the function we were looking for and should 
resemble / at least on the training set. Arbitrary functions can not be repre- 
sented in a computer, so a common approach chooses a parameterized function 
class 5" C (X — > T) to pick y = ^(w) from. 

One approach to find the optimal function is to calculate the gradient of C 
with respect to the function class parameters w at an arbitrary start position 
Wq. One can follow the negative gradient — VC| lUo in small steps, leading to 
smaller costs if the step size is well chosen. This method is called gradient 
descent pQ. 

The disadvantage is that the method will stop at every w that fulfills VC\ W = 
0, called an extreme^ of C. Depending on the considered function class 5F the 
cost function can have multiple extrema with respect to the parameters w, and 
not all have to be global minima after definition 12.31 In other words, gradient 
descent will converge to whatever extrema the gradient lead to, starting at Wq. 
This means we can never be sure that we reached a global minimum, if no closed 
solution of VC is possible. 

Keeping this in mind, it is wise to choose J such that every extrema of the 
cost function is a global minimum, ideally the only one. In this thesis we will 
focus on convex functions. 

2.1.2 Convexity 

There are other functions without local minima, but convex functions provide 
other desirable mathematic properties (see [7] for details) . In some special cases 
it is even possible to derive a closed analytical solution (section 12 . 1 .5(1 . 

Definition 2.4 (Convex function [7]) A function f : IR" — > HI is convex if 
dom/ is a convex set and if for all x, y £ dom/, and 9 with < 6 < 1, we 
have 

f(9x + (1 - 6)y) < 6f(x) + (1 - 6)f(y). (2.4) 

1 If also VVC|,i> > we speak of a minimum, but not necessarily a global minimum. 
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We can tighten the definition further. When the strict inequality in equation 
12.41 holds, we speak of a strictly convex function. These have the additional 
advantage that the minimum is unique, i.e. no other extrema exists. 

Anyway, the main property we are interested in both cases is that every 
extrema is a global minimum: 

Proposition 2.1 (First order convexity condition [7]) Suppose f is dif- 
ferentiable (i.e. its gradient V/ exists at each point in dom/, which is open). 
Then f is convex if and only if dom / is convex and 

f(y)>f(x) + Vf(x) T (y-x) (2.5) 

holds for all x, y € dom/. 
Proof: See [7j. 

Every extrema x of f satisfies Vf(x) = 0. According to proposition 12.11 
if / is convex then Vy € dom/ : f(y) > f(x), which identifies x as a global 
minimum of /. 

However, we do not necessarily wish the functions y € 5" but the discussed 
cost functions to be convex: 

Proposition 2.2 (Convex cost function) Let § € x T) be arbitrary 

but given. If R(y) and all y 6 are convex functions then the cost functions 
( eq. \ 2.1\) and ( eq. \2.2\) are convex in the domain J ' . 

Proof: y(x^) and — t^' are convex, and how lemma l2~3l shows, the sum of 
them is convex too. Lemma 12.41 proves that norms are convex, so their sum has 
to be, too. At last, C emp and R{y) are convex, and so is their sum C reg . □ 

Lemma 2.3 (Pointwise sum [7]) The pointwise sum of two convex functions 
fi and fa, / = /i + f% with dom / = dom /i n dom/2, is a convex function. 
Proof: Insert (Ox + (1 — 6)y) into /i(-) + /2O and apply (eq. 12.41) two times. 
□ 

Lemma 2.4 (Norms |7]) Every norm on IR™ is convex. 
Proof: If / : M" — s- IR, is a norm, and < 6 < 1, then 

f(9x + (1 - 9)y) < f(8x) + /((l - 0)y) = 6 f{x) + (1 - 8)f{y). 

The inequality follows from the triangle inequality, and the equality follows from 
homogeneity of a norm. □ 

2.1.3 Function classes 

Convexity of the cost function depends on its formulation and the considered 
function class ?C The family of convex functions is large (see [7] for 

examples). However, we want to introduce two classes that will play a mayor 
role in this thesis because they allow a closed analytical solution under the 
squared Li norm. 
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Definition 2.5 (Linear function [7J) A function f : TR P — > JR q is linear if 
for all x,y € TR P and a, p & TR.it satisfies the condition 

f(ax + /3y) = af(x)+j3f(y). (2.6) 

Linear functions have nice analytical properties, e.g. they can be uniquely 
determined by a matrix W £ IR pX9 : f(x) = W T :r. Convexity be seen by 
comparison of (cq. 12. 6[) and (eq. 12. 4p . 

Definition 2.6 (Affine function [7J) An affine function f : 1R P —> lR q is the 

sum of a linear function and a constant: f(x) = W T cc + b. 

The bias b violates the linear condition, but affine functions share many prop- 
erties with linear functions. They can express more relationships, however. For 
example, a line that does not cross the origin can be expressed as an affine, but 
not as a linear function. 

By extending the input vector a; by a constant (i.e. xq = 1) one can express 
an affine function as a linear function. This trick can be applied if the algorithm 
at hand requires linearity but the data can only be properly predicted by an 
affine function. 

Lemma 2.5 Affine functions are convex. 

Proof: We show convexity for every component fi(x) = wjx + b{. 
f i {6x + {l-6)y) = wj (fix + (1 - 8)y) + h 

= 6wJx + (l-6)wJy+(l-6 + 6)bi 
= 6Mx) + (l-8)Mv) 

□ 

2.1.4 Validation 

In regression, we do not know the real target function / of assumption 12. 1 [ but 
aim to find a "similar" function y £ 3\ Since we can not construct a similarity 
measure between / and y directly, we define a cost function instead. Though 
one can obtain an y* optimal to the cost function, the approach is susceptible 
to many sources of error: 

• The cost function can be chosen differently, leading to different optimal 
functions. Here the choice of regularized vs. empirical arises as well as the 
choice of the regularization term. Anyway, the two presented equations 
are not the only imaginable cost functions (see [T] for more). 

• The considered function class J might not be suited for the data at hand. 
On the one hand, 5" can be too weak, e.g. a parabola can not be represented 
by a linear function. If, on the other hand, the sample size n is too small, 
y* can reduce the individual costs of all training samples to zero, but take 
unreasonable values between them. This effect is called over-fitting [S]. 
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• The training sets sampling might introduce errors. If a region of X is left 
out, / can not be estimated there. More general, unbalanced sampling 
leads to a good estimation of / where many samples are available, but 
a poor estimation where this is not the case. The sum of the individual 
costs tolerates big errors in rarely sampled regions, if it means to shrink the 
error in highly sampled regions even slightly. This effect is also influenced 
by the choice of 9\ Especially linear and affine functions are known to 
react badly to unbalanced sampling j9]. 



The last point can be circumnavigated if one can ensure identical and inde- 
pendent distributed (iid) sampling. This way, given enough samples, the training 
set will be sampled homogeneously from X. However, in most practical cases 
one can not give such a guaranty. 

The common procedure to validate y* is to withhold a test set §' £ 3?(X x 
T) , S' n S = from the training. A comparison of the normalized empirical costs 
of training and test set (called training and test error) demonstrates how well 
y* generalizes. If the errors are approximately the same, the optimal function 
predicts unseen inputs obviously as well as the training samples. However, if 
the test error is significantly higher than the training error, y* is probably over- 
fitting. Counter measures would include more training samples, a less complex 
function class or stronger regularization. 

Another useful test is the computation of y* for the same cost function and 
training data, but a different function class 3 r . The comparison of the test and 
training errors can tell us which function class is better suited. Especially if 
one can choose the complexity in a family of function classes (e.g. number of 
support vectors in kernel machines), it is interesting at which complexity the 
error saturates. Finding this point (i.e. the simplest model which estimates / 
well) is sometimes referred to as model selection or model comparison [9]. 



Algorithm 1 Linear Least Squares Regression 

Require: x W , . . . , x W £ IR P ; tW , . . . , tW 6 W 

Co = zeros (p,p) 
Bo = zcros(p,q) 
for £= 1, ... ,n do 

Bi = B 4 _i + x^t^ T 
end for 

W = inv(C„)B„ 
return W 
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2.1.5 Linear regression algorithms 

The least squares regression problem applies the squared Li norm and the class 
of linear functions on the empirical cost function fcq l2.1|) : 



C em P(y §) = J2 \\y( X UY) - t ^\\\ = ||W T X - T||| (2.7) 



where || • ||| is the squared Frobenius norm and the matrices Xy = and 
Tjj = t+ are introduced for simplicity. 

Extending the input samples by a constant (i.e. Xq = 1) allows us to use the 
same formulation for the class of afhne functions. 

To find the optimal function y*(x) = W* T a; we set the derivation of equa- 
tion [2?7] with respect to the parameter matrix W to zero: 

^g^l = 2X X-W-2XT- = 
aw 

=^W* = (XX T )" 1 XT T (2.8) 

which holds, providing the rows of X arc linearly independent and the covariancc 
matrix XX T therefore of full rank. 

We can also use the regularized cost function (eq. \2.2\i with the regularization 
term R(y) = R(W) = Atr(W T W) which penalizes the squared L2 norm of the 
column vectors of W. A £ 1R regulates how strong the penalty for complex 
functions is. 

9C " 9(W ' §) = 2XX T W-2XT T + 2AW = 
=> W* = (XX T + AI) _1 XT T 

Regression with this regularization term is known as ridge regression [5] or in 
the context of neural networks as weight decay [T]. 

Both algorithms have a complexity of 0{p 2 ) in space and max(0(np 2 ), 0(p 3 )) 
in time. The 0(p 3 ) term originates in the matrix inversion. 



Algorithm 2 Linear Ridge Regression 

Require: x^\ . . . , x^ e 1R P ; t (1 \ . . . , 6l';Ael 
C = zcros(p,p) 
Bo = zcros(p,q) 
for i = 1, . . . ,n do 
Cj = Q_i +x^x^ T 
B t = B 4 _i + x^t^ T 
end for 

W = inv(C„ + AI)B n 
return W 
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2.2 Kernel techniques 

Realistic processes can seldom be approximated sufficiently with linear function 
classes. Classes of nonlinear functions (e.g. neural networks) could provide sat- 
isfactory results, but their cost functions are rarely convex and the optimization 
therefore complicated. 

Another approach is the nonlinear expansion of the input samples x £ X. 
This way, projected into a high dimensional feature space, the problem can be 
treated as linear, while the solution in original space X is nonlinear. The classic 
example to demonstrate this is the XOR problem. 

In the XOR problem, the input x £ {0, l} 2 and the target t £ {0, 1} are 
connected by a simple rule: If both entries of x are the same then t = 0, 
otherwise t = 1. It is easy to see that there is no linear function y(x) = w T x 
that solves this problem, but if we expand the input vector x' = [x\, X2, XiX2] r , 
the parameter vector w = [1, 1, — 2] T explains the relationship perfectly. Thus, 
with the suggested expansion, the XOR problem is linearly solvable. 

Expansion almost always increases the dimensionality of the input. The fact 
that one normally does not know the perfect expansion beforehand rarely lead to 
feasible expansions that solve the problem. In the case that a suitable expansion 
size would outnumber the number of training samples, the kernel trick can be 
employed. 

2.2.1 The kernel trick 

The kernel trick defines the nonlinear expansion indirectly. One exploits the 
representer theorem, which states that the optimal function of a cost function 
can be represented as scalar products of the training set. 

Kernels are a broad class of functions, which are equivalent to a scalar prod- 
uct in a corresponding vector space. The intuition of the kernel trick is to 
exchange the euclidian scalar product by one in a high dimensional nonlinear 
feature space, i.e. another kernel. 

Choosing a nonlinear kernel is equivalent to a nonlinear expansion of the 
input vector. This way one can handle huge feature spaces. The drawback is 
that the optimal function depends on scalar products to all training samples, 
which can be infcasiblc for large sample sizes. 

In the following we introduce the elements of kernel methods used in this 
thesis. To follow the derivation path in full length, the reader is referred to [4]. 

Definition 2.7 ((Positive definite) kernel [4]) Let X be an nonempty set. 
A function k on X x X which for all n £ IN and all x^ x \ . . . , x^ n > £ X gives rise 
to a positive definite Gram matrix is called a positive definite kernel. 

For a definition of Gram matrix and positive definite matrix, see [4]. 

One can prove that every positive definite kernel function uniquely implies a 
scalar product {4>{x),ijj{x')) = k{x,x') with i/j(x) being the projection of a; £ X 
into another space. 
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To be more exact, one can show that a kernel function implies a reproducing 
kernel Hilbert space (RKHS) of functions / : X — > 1R with a scalar product. 
With a detour over Mercer kernels (which are equivalent to positive definite 
kernels) one can show that it is possible to construct a mapping ip(-) for which 
fc(-, •) acts as a dot product. 

In the following we will refer to an arbitrary kernel function •) = (?/>(■): 
in the corresponding RKHS JC. 

Theorem 2.6 (Representer Theorem [4|) Denote by R : [0,oo) — > IR a 
strictly monotonic increasing function, by X a set and by C : (X,IR, IR)™ — > 
IR U {oo} an arbitrary loss function. Then each minimizer y : X — > IR of the 
regularized risk 

C((s« , tW,y(xM)), (* (n \ t (n \ !/(*<">))) + R(\\y\\x) (2.9) 
admits a representation of the form 

n 

y(x)=J2aik(x^,x). (2.10) 

i=l 

Proof: See [4]. Note that a loss function is a slightly different defined cost 
function and regularized risk a cost function with a rcgularization term. 

Multivariate functions y : IR P — > IR 9 can be represented component wise 
Uj{x) = 5Zr=i Aijk( x i x )i giving rise to a new parameter matrix A £ R nxq : 

y(x)=A T k(x) (2.11) 

where k(x) = [k(x^\x), . . . , k{x^ n \ x)] T is called a kernel expansion of x. 

Remark 2.1 ("Kernel trick" |4j) Given an algorithm which is formulated in 
terms of a positive definite kernel k, one can construct an alternative algorithm 
by replacing k by another positive definite kernel. 

The kernel trick allows the replacement of scalar products (which are positive 
definite kernels) by complex, nonlinear kernels. One can take this replacement 
as a projection tp(-) into a high dimensional feature space. In this space, we can 
solve our optimization problem linear. The only restriction is that the original 
algorithm must be formulated entirely with scalar products. 
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Algorithm 3 Kernelized Ridge Regression 

Require: x^\ . . . , € IR P ; i«, . . . , *(") e IR 9 ; A e H; ft : IR P x W -> IR 
K = zeros(n,n) 
for j=l,...,ndo 
for j = 1 , . . . , n do 

Kij = fc(ajW,aj( J ')) 
end for 
end for 

A = inv(K + AI)T T 



2.2.2 Kernelized regression 

To demonstrate the kernel trick, we will derive a kernelized version of the linear 
ridge regression algorithm. 

First we have to notice that there are no scalar products in algorithm [2] 
Therefore, it is necessary to reformulate the algorithm [3]. Again, we start by 
setting the derivation of C reg with respect to W to zero: 

2XX T W - 2XT T + 2AW = 

-i(XX T W* -XT T ) =XA (2.12) 
A 

with A = -i(X T W* - T T ). Next we will clear A of its dependency on W*. 
Substituting (eq. I2.12[) in C reg and the derivation with respect to A yields: 

tr((A T X T X - T) 2 ) = tr((A T K - T) 2 ) 

2KK T A - 2KT T + 2AKA = 

(K + AI)"^ 7 " 

where the Gram matrix of scalar products K, 3 = x^ T x^ 6 H™ xn can be 
replaced by any other kernel matrix = k(x^' , x^'). Prediction follows 

y(x) =W T x = A* T X T x = A* T k(x) (2.13) 
The complexity of this algorithm is 0(n 2 ) in space and 0(n 3 ) in time. 



dC reg (W,S) 

aw 

=s> w* = 



C reff (A,S) = 
dCr e B(A,§) 
dA 
^A* = 
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2.2.3 Kernels 

Since the kernel implies the feature space, choosing a nonlinear expansion is 
equivalent to choosing a kernel. Therefore it is necessary to know common 
kernel functions and their properties. 



Polynomial kernels As we have seen in the XOR example, a polynomial 
expansion can be useful. The direct approach would be to collect all multivariate 
monomials up to degree d in a vector. For p dimensional input vectors, the 
expanded vector would have dimensionality I d+p r 1 1 = l - d ^ p ~ 1 )} . 



V d J - d\( P -i)\ 
The corresponding kernel function 

k(x,x') = {x T x' + l) d (2.14) 

projects into the same space of polynomials with degree < d. To demonstrate 
this let us consider the following example [9]: The input vectors shall be x, x' £ 
H 2 . The polynomial kernel function of degree 2 for those two is: 

k(x,x') = (x T x' + lf 

= (xiXi + X2X2 + l) 2 

= (xix'i) 2 + (x 2 X 2 ) 2 + 2x\X\XlX 2 + 2xix[ + 2x2X 2 + 1 

= \x\,x\,\Plx\X2,\F2.x\,\Plx2,\\ [(x^) 2 , (x' 2 ) 2 ,V2 x[x' 2 ,V2 x[,V2 x' 2 , l] T 
:= i[>(x) T %l>(x') 

The projection ip(x) defined this way spans the space of polynomials of degree 
2 and has therefore dimensionality 6. 

Polynomial kernels are all about the euclidian angle between two inputs. 
Additional, the euclidian length of both vectors play a role. If one likes to 
get rid of the last effect, a normalized polynomial kernel returns only the cosine 
between x and x' to the power of d: k(x, x') = { t3t\ \\ X '\\ ) d - ^ e induced feature 
space is of lower dimensionality, but the kernel reacts less extreme to large input 
vectors. 



RBF kernels A radial basis functions (RBF) kernel has the general form of 

fe(*,*0 = e^(-(^|^) d )- (2-15) 

Different norms || || and parameters d and a generate different kernels, which 
all imply an infinite dimensional RKHS [1] . The two most popular both use the 
Z/2 norm and are called Laplacian (d = 1) and Gaussian (d = 2). The latter 
is by far the most common kernel and has become synonymous with the name 
RBF kernel. 

RBF kernels are based on the distance of two input vectors. The kernel 
parameter a determines the radius of influence for training samples. In well 
sampled regions (and well adjusted a) the kernel shows good approximation 
properties but where no training data is available the kernels will produce only 
small output and therefore perform poorly. 
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2.2.4 Kernelized covariance eigenvalue problem 

Many unsupervised learning algorithms (e.g. PC A [8]) demand an eigenvalue 
decomposition of the covariance matrix: 

cov{X) = E[xx T ] - E[x]E[x] T = -XX T - 4jXll T X T = V c A c Vj (2.16) 

In most cases one aims to project samples x onto eigenvectors U c : x' = Vjx. 
To catch more complex, nonlinear relationships, one can expand x into a non- 
linear feature space ip(x). The kernel trick can help to describe the expansion 
more efficient (e.g. Kernel PC A [IT]). 

Original approach One starts with ensuring zero mean by subtracting the 
sample mean from all columns X c = X — i-Xll T . Subsequently, one can per- 
form the eigenvalue decomposition on the centered covariance matrix ^X c Xj = 
U c A c Uj. Standard implementations of eigenvalue decompositions have a com- 
plexity of 0(p 3 ) with p being the dimensionality of samples x. 

Inner and outer product [8] If we want to reformulate the covariance ma- 
trix eigenvalue decomposition, we can exploit a relationship between the inner 
product X T X and outer product XX T of X. 

The singular value decomposition X = USV T with £ = [A 1 / 2 0] refers 
to eigenvalues and eigenvectors of both outer and inner product. Therefore, 
an eigenvalue decomposition of the inner product X T X = VAV T produces 
the same non zero eigenvalues A r as the outer product and the corresponding 
eigenvectors U r and V r are related by: 

U r = XVrA- 1 / 2 (2.17) 

Kernelized approach We proceed as before. With the Gram matrix of scalar 
products K = X T X at hand, we first have to center the data represented by it. 
Subtracting the sample mean X c = X — iXll T , it is easy to proof that the 
centered kernel matrix is: 

K c = XjX c = (I-ill T )K(I--ll T ) (2.18) 

n n 

Secondly, we perform the eigenvalue decomposition K c = V c A c Vj. Using 
(eq. I2.17[) we can now express the term Ujx with the nonzero eigenvalues A r 
and corresponding eigenvectors V r of K c : 

Ujx = VnK 1/2 V T X T a: (2.19) 

Replacing K by another kernel matrix K and the term X T a; in (eq. 12 . 19[) by 
k(x), we implicitly project the data into a feature space ip(x) defined by the 
kernel: 

\5jip{x) = A" 1 / 2 Vjk(x) (2.20) 
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2.2.5 Kernel matrix approximation 

The application of the kernel trick is restricted to a small amount of samples 
(n w 5000 ... 10000), because the kernel matrix stores nx n entries. If this 
number is exceeded, one would like to sacrifice approximation quality for man- 
ageable kernel matrix sizes. Different approaches have been made to overcome 
this problem by means of an approximation K € ]R mxm of the kernel matrix 
K £ WL nxn , where m << n (see for example JT0] ) - They all have in common 
that they assume a given subset of the training samples (support vectors) to be 
is representative for the whole set. 

An algorithm to select support vectors is presented in section l2.2.6l 

Subset of Data |10| The simplest approach to kernel matrix approximation 
is called subset of data (SD). In this approach only the subset of m support 
vectors is used and the approximation therefore has size m x m. Because, only 
the support vectors influence K all information of the remaining n — m samples 
is lost. 

If the support vectors are chosen randomly out of the training set, subset of 
data is also known as the Nystrom method |12j . 

Projected Process |10j A better suited approach is called projected process 
(PP). Here, the non support vector rows are removed but all columns arc kept 
and the resulting matrix K has size M mxn . The matrix KK T G IR mxm is used 
to approximate K 2 . While preserving much more information, the method can 
only be applied to algorithms which use K 2 . 

This restriction can be circumnavigated, if an eigenvalue decomposition of 
K has to be performed anyway. Because K 2 = VA 2 V T for K = VAV T , it is 
sufficient to perform the eigenvalue decomposition on KK T . Taking the square 
root of the eigenvalues A, together with the unchanged eigenvectors V, yields 
the projected process approximation of K « VA 1,/2 V T . 

2.2.6 Support vector selection 

Whatever approximation method one chooses, the choice of support vectors is 
crucial to preserve as much information as possible. Selecting an optimal set of 
support vectors means to minimize a specific cost function with respect to the 
set of support vectors, which is a hard combinatorial problem [7]. 

A number of heuristics have been proposed to find a suitable set. Beside the 
purely random Nystrom method |12j . the sequential sparse Bayesian learning 
algorithm [5] (related to the relevance vector machine [5]) estimates the contri- 
bution of training samples to the optimization problem. The procedure works 
iteratively, but not greedy. For big training sets, the convergence is very slow. 
However, the method is specified for a Bayesian setting (like Gaussian processes 
[9], for example) and therefore not suitable for our purposes. 
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Algorithm 4 Greedy support vector selection algorithm 

Require: v, x^\ . . . , i'"' 
SVi = {x^} 

for i = 2, . . . , n do 
a t = K^fe^W) 
ei = k(x^,x^) ~ k(x^) T ai 
if e, < v then 

SV< = SVi-iU{x®} 



K7 1 



a,a„- 



-a. 



1 



else 

Kr 1 = Krii 

end if 
end for 
return SV n 



In this thesis, we want to investigate another heuristic, which was proposed 
by Csato and Opper [Tj5] and applied to a kernelized version of least square 
regression by Engel [33] . 

We assume the data x to be mapped into a feature space ip(x), defined im- 
plicitly by a kernel k(x,x') = ip(x) T ip(x'). Theorcm l2.6l (representor theorem) 
assures that the optimal function is a linear combination of scalar products with 
training samples. If two expanded training samples would be linear dependent, 
obmitting one would not change the optimal function. Likewise, those samples 
which can be approximated badly using linear combinations of the remaining 
set are likely to be important. 

The idea of algorithm |4] is that for a given set of support vectors SV = 
{x^\ . . . , a;*-" 1 '}, we define the approximated linear dependence (ALD) condition 



e(x) 



< V 



(2.21) 



where v is an accuracy parameter determining the level of sparsity. The term 
e(x) describes the minimal distance to the affine hull of set SV. It serves as an 
importance measure, which is independent of the problem we want to optimize 
afterwards. 
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When we denote the kernel matrix K.^ = k{x^\x^') and the kernel ex- 
pansion k(x) = [k(x^\ x), . . . , k{x^ m \ a;)] T , we can express the left side of 
fog. 12.21]) as e{x) = min a a T Ko — 2a T k(x) + k(x,x) := min a L(x, a). The 
derivation with respect to a yields: 

dL(x, a) „, , , i 

= 2Ka - 2k(x) = 

oa 

^>a* = K^kix) 
which allows us to reformulate the ALD condition (cq. 12.211) : 

e(x) = k(x, x) - fc(a;) T K" 1 fc(a;) < v (2.22) 

All training samples x for which (eq. I2.22j) holds are considered "well ap- 
proximated" and therefore omitted j!4| . 

More precise, algorithm 0] runs sequentially through the training samples 
and tests if the current sample scW fulfills the ALD condition for the current set 
of support vectors SVi-i. If the condition fails, the sample becomes a support 
vector and the inverse kernel matrix (which is guaranteed to be invertible for 
v > 0) is updated using the Woodbury matrix identity [5]. 

The overall complexity of the algorithm is 0(m 2 ) in space and 0(nm 2 ) in 
time. 

The support vector set obtained by this method should be well-conditioned 
for most optimization problems. It also has the nice property that for every v 
there is a maximum size for SV n , which is independent of n. 

Proposition 2.7 Assume that (i) k is a Lipschitz continuous Mercer kernel on 
X and (ii) X is a compact subset ofTR d . Then there exists a constant C that 
depends on X and on the kernel function such that for any training sequence 
jj-gM joo^ and for any v > 0, the number of selected support vectors N, satisfies 
N < Cv~ d . 
Proof: See [14]. 



As a downside, the number of selected support vectors m depends largely on 
input set X and kernel £;(■, ■), There is no way to estimate m from the parameter 
u 1 which is responsible for the sparsity of SV n . In practice one is interested in 
a support vector set of an appropriate (large but not too large) size. The 
experiments conducted in chapter [5] used RBF kernels and kept a fixed u = 0.1 
while varying the kernel width a. This procedure was repeated until a set of 
suitable size was found. 
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Chapter 3 

Reinforcement Learning 



In this chapter we will give an outline how to solve control problems (like 
robot navigation) with reinforcement learning. First, we will define the nec- 
essary terms of Markov decision processes (sec. !3.lTTj) , followed by a discussion 
of state space representation (sec. I3.1.2p . especially trigonometric polynomi- 
als (sec. 13.1. 3p . Secondly, we will discuss linear algorithms for value (LSTD, 
sec. 13.2. ip and Q-value estimation (LSQ, sec. I3.2.2[) . Finally, we introduce 
the principles of policy iteration and present the least squares policy itera- 
tion (sec. I3.3[) followed by a summarizing conclusion of the complete process 
(sec.EU). 

3.1 Introduction 

Learning behaviour for an autonomous agent means learning to choose the right 
actions at the right time. In machine learning, this is called a policy: A function 
that chooses an action based on some input representing the state of the world. 
In human terms, to define a policy we have to define two things first: The 
perception and the goal of the agent. The first defines what we see as a "state" , 
the second what we see as " right" . 

While perception of the world is an open field which is mostly circumnav- 
igated by defining some "essential variables" like position and orientation, be- 
havioural experiments [classical conditioning, for example Pavlov's dog bell) 
lead to a simple approach to the second question: Reward and punishment. 
In short, an agent should try to maximize his future reward while minimizing 
future punishment. Thus, training the agent becomes a simple choice what it 
shall and shall not do, eliminating the need to show it how. 

Technically, we assume to have access to some function that filters meaning- 
ful (discrete or continuous) states out of the sensor data, available to the agent. 
The exact nature of this function is of little importance to this chapter, besides 
that it provides the full state of the world. In this thesis, we want to show 
that Slow Feature Analysis (SFA, chapter 2]) can learn such a function under 
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the right conditions. 

The evolution of state and reward over time is modelled as a stochastic 
Markov decision process (MDP). Since a critical choice leading to reward or 
punishment might be necessary long before its fruits will be received, we can not 
use standard regression approaches. Instead we try to estimate future reward 
(punishment is simply negative reward) for all actions in a state and take the 
action which is most promising. The sum of the expected future reward is called 
value (sec. I3.2|) . 

Unfortunately, the future (and with it the value) we are trying to predict for 
the policy, depends on the policy itself. Since this is a "hen or egg" problem, 
policy iteration methods repeat the steps value estimation and policy determi- 
nation until they converge to the optimal policy which maximizes the value for 
all states (section [33]) . 

3.1.1 Markov processes 

After learning, we wish for the agent to move towards the reward. To do this, it 
has to learn which transition exist between a given set of states and where the 
reward is given. Once this knowledge is somehow established, it needs to find a 
decision function, telling it where to choose which action. 

As we will see, the central element is the estimation of the sum of expected 
future rewards, called value. Value estimation is easiest in Markov reward pro- 
cesses (MRP), so we will start by introducing them to the reader. The MRP 
itself is unable to model control tasks, but there exists an extension named 
markov decision processes (MDP). MDP models the element of choice by intro- 
ducing a set of actions between which a policy can choose. We will finish this 
section by defining the optimal policy, a function which chooses the action that 
will maximize the future reward. 

Definition 3.1 (Markov reward process MRP) The discounted MRP 
M = (S, P, R, 7) consists of a set of n states S = {si, . . . , s n }, the transition 
probabilities P : S x S — > [0, 1] with Vs G S : Y17=i ^ > ( s ' s ») = I7 a function that 
assigns a real valued reward to every transition R : S x S — > JR, and a discount 
factor 7 <G (0, 1]. 

MRPs are used to describe the statistical properties of the random walk of a 
variable x € S in discrete time steps and the reward it collects in the meantime. 
In opposition to iid drawing, the random walk is subject to the Markov property 

P(x t \x t -i, ...,x ) = P(x t \x t -i), (3.1) 

in other words the probability of being in state x% € S at time t depends 
exclusively on the predecessor state Xt-i € 5*. In most situations reward depends 
solely on the outcome, so we simplify in the following R : S — > ]R as only 
dependent on the target state of a transition. 
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Figure 3.1: A deterministic 16 x 16 states MRP (left) and the resulting value 
for 7 = 0.9 (right). Arrows indicate the transition direction, entering G is 
rewarded. 



Definition 3.2 (Value |22j ) The value V : S — > IR, is the expected sum of 
future rewards, discounted by the factor 7 G (0, 1]: 



V{x) = E 



(3.2) 



Due to the infinite sum, one commonly uses a recursive definition which is known 
as the Bellman function |22j : 

V{x t ) = E [R(x t+1 ) + jV{x t+1 )} = P(xt,x')(R(x') + jV(x')) (3.3) 

x'es 

Intuitively, the value function V(x) tells us which states are promising (figure 
13.11) . However, it does not say how to get there. For control purposes, we would 
need a function that returns the value of all possible actions in the current state. 
Within the framework of Markov decision processes, we can define exactly such 
a function, called Q-value. 

Definition 3.3 (Markov decision process MDP |27J) The discounted 
MDP M = (S, A, P, R, 7) is an extension of the MRP to control problems. It 
defines an additional set of m actions A = {ai, . . . , a m } and transition proba- 
bilities F : S x i x 5 4 [0, 1] and reward function R : S x A x S — > JR are 
extended to depend on the executed action as well. 

As in the MRP case, we simplify the reward function as exclusively depen- 
dent on the target state: Vs € S : R(s) = R(-, •, s). 

Definition 3.4 ((Stationary) Policy [27]) A function n : S x A -> [0, 1] 

with Vs € S : ^2 7r ( s > 0,) = 1 is called a stationary policy. 



A policy represents a control or decision automat. n(s, a) is the probability 
that the automat chooses action a in state s. Since the value in a MDP depends 
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Figure 3.2: The Q-values of a 16 x 16 MDP with 4 actions and the left of figure 
13. II interpreted as a deterministic policy. Arrows indicate to which of the actions 
(movement directions) the Q-values belong. Note that in front of the goal, the 
Q- value of every action is highest. 



on future decisions, it is necessary that the policy is stationary (i.e. does not 
change) throughout value evaluation. 

Definition 3.5 (Q- value [27]) The Q-value : S x A -> TR of an MDP is 

the value of an action executed in a state, if all following actions are selected 
with respect to the stationary policy tt: 

Q 7r (xt, a t ) = E[R(x t +i) + jQ'(x t +i, Ot+i) | x t +i ~ P(x t , a t , ■), a t +i ~ 7r(a;t + i, ■)] 

= ^2P(xt,at,x')(^R(x')+-f^2iT(x',a')Q K {x',a')) (3.4) 
x'es a'eA 

The Q-value function defines implicitly a value function for every action (see 
figure [3"?2"j) . This way one can navigate by always choosing the action with the 
highest Q-value. 

Of course, we can also reformulate the classical value function (cq. 13. 3[) . 
which is now dependent on policy tt, in terms of Q-values: 

V"(xt) = ^<ir(xt,a)Q"(xt,a) (3.5) 

Lemma 3.1 For a given policy tt, the value of a MDP M = (S, A, P, i?, 7) is 
equal to the the value of the MRP M' = {S, P' , R, 7) with Vs, s'gS: P(s, s') = 
Z)kLi ^O, a k )P{s, a k , s'). 

Proof: Insert (eg. I3.4[) in (eg. 13. 5p . Together with the definition of P'(s,s') 
one can derive (eq. I3.3p . □ 

Definition 3.6 (Optimal policy) The optimal stationary policy tt* maximizes 
the value V ir (s) for every state seS: 

W,VseS:V v *(s)>V*(s) (3.6) 

This is obviously the control function we were looking for. A control algorithm 
simply has to draw the actions according to tt* and is guaranteed to find the 
best way to the reward. Lemma 13.21 shows how to define such a policy with the 
help of Q-values. 
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Lemma 3.2 (i) There can exist several n* but (ii) at least one is equivalent to 
a deterministic selection function a* : S — > A with Vs € S : 7r*(s,a*(s)) = 1: 

a* (s) = arg max (s,a) (3.7) 

Proof: (i) Consider oi , a-2 € A such that max Q w (s, a) = Q 71 '(s,ai) = (^(s, 02). 

a 

For every A € [0, 1], the policy ir(s, ai) = A, 7r(s, 02) = (1 — A) is optimal. 

(ii) Due to (eq. [331) and (eq. I5T7|>. Vs € 5 : F*(s) = max Q 7r (s, a). Therefore, 

a 

7T* is optimal. □ 

Of course, the Q-values them self depend on the policy and the only way to 
solve this dilemma is to improve both quantities in an alternating fashion (see 
policy iteration, section f3.3[) . 

3.1.2 State representation 

The MDP formalism provides us with a (possibly infinite) set of states S. At this 
point we are interested in value estimation, that means we want to find a func- 
tion v : S — > 1R that at least approximates the value for all states. If we restrict 
oursclf to linear functions, i.e. v(x) = w T x, we need a linear representation 
4> : S — > H m to project the states into a subset of IR™. 

Definition 3.7 (State representation) Let X be an arbitrary set. An infec- 
tive function <f> : S — > X will be called a representation of the set of states S . If 
X C ]R m for m G IN + , </> will be called a linear representation. 

We aim here to find a representation that works with linear functions, i.e. 
V n (s) « v n (4>(s)) = w T cf>(s), in other words a linear representation. The 
injectivity of <f> guarantees that no two states have the same representation. 
Obviously the representation need additional properties to approximate a value 
function well. However, if S is a finite state space, there exists a representation 
that allows the exact approximation, i.e. V n (s) = v n ((f)(s)). 

Definition 3.8 (Tabular representation |9j) The linear representation 

0:5—)- {0, 1}" C H" is called a tabular representation of S = {si, . . . , s n }, if 
Vi € {1, . . . , n} : <pi(si) = 1 and Vj ^ i : <fij( s i) = 0- 

Lemma 3.3 The value function of every MRP or MDP can be exactly approx- 
imated as a linear function of a tabular representation (def. \3.8\) . 

Proof: The linear function v v (s) = w T <f>(s) with Vi : Wi — V^Si) represents 
the value P(s) exactly, since Vi : v(si) — Y^j=i w j ( l ) j{ s i) = w i = V^( s i)- O 

The size n of this representation is equal to the number of states. Since the 
value estimation algorithms we will discuss later have a complexity of 0(n 3 ), 
this is not feasible for larger problems. Especially if we have a continuous, i.e. 
infinite state space, we are not able to approximate the value function exactly 
anymore. 
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Continuous state space Due to the infinite number of states, we speak of 
the transition probability kernel P(s,a, •) and a reward function R(-) |33j . 

Instead of giving a formal derivation of continuous state spaces, we will 
restrict us to subsets of the euclidian space. Other spaces are imaginable, but 
euclidian subsets should cover most practical cases, from robot positions to 
pressure intensity. 

Assumption 3.9 (Compact subset |33j ) The continuous state space S is a 
compact subset of the d dimensional euclidian space. 

For example, the closed set of tuples that form the unit square [0, l] 2 is a compact 
subset of the 2-dimensional euclidian space. 

The approximation quality of a linear function will depend entirely on the 
state representation <f> and the value function at hand. The value will probably 
be continuous on most parts of a continuous state space, so it makes sense to 
choose a set of basis functions that are known to approximate continuous func- 
tions well. Examples are polynomials, splines |19| and trigonometric polynomials 

p]. 

3.1.3 Trigonometric polynomials 

In this thesis, trigonometric polynomials will play a major role, so we shortly 
introduce them here and discuss their approximative capabilities. 

Definition 3.10 (Trigonometric polynomial [20j ) A trigonometric polyno- 
mial T : 1R — > IR of degree d with the coefficients clq, . . . , ad and b\, . . . , 6<j is 
defined as: d d 

T(x) = do + ajcoslflnjx) + bjsin{2-Kjx) (3-8) 

3=1 3=1 

Maybe the most famous trigonometric polynomial (of degree d — > oo) is the 
Fourier series [20] . Trig, polynomials of degree d are a linear combination of 
the first 2d + 1 trigonometric basis functions ip(x): 

Vie]N:^) = {Xgr^)'odd ( 3 - 9 ) 

So we can express every trig, polynomial of degree d as T(x) — Y^=o Wiipi(x). 
However, most continuous state spaces will be multidimensional. 
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Definition 3.11 (Multivariate trigonometric polynomial) A multivariate 
trigonometric polynomial T : IR P — > 1R of degrees di,...,d p in the respective 
input dimensions, with the coefficients w € TR m and m = IIn=i(2^p + l) * s 
defined as: 

2d! 2d p p 

T(x) = ... g W fl (n u ... < n p )t[*PnM (3.10) 

ri\— rip— k— 1 

VIP \ 

fi(ni, . . . ,n p ) = I [ (2(^ + 1) , (index function) (3.11) 

»=i y j=»+i y 

Intuitively spoken, the multivariate trig, polynomial is the weighted sum of all 
combinations of trigonometric basis functions in all input dimensions. 

Definition 3.12 (Trigonometric representation) The linear representation 
<p : S — > IR m of the continuous state space S — [— 1, l] p C TRF called trigonomet- 
ric representation with degrees d\, . . . , d p and m — J]^ =1 (2dp + 1) is defined as: 

p 

Vi e : Vnj € {l,...,dj : ^ (ni ,...,„ p) (as) = JJ^*^) ( 3J2 ) 

fc=i 

Thus, we can express every multivariate trig, polynomial as a linear function 
with a trig, representation of the same degree: T(x) = w i4>i{ x )- 

Theorem 3.4 ([18j, Thm 4.25) For every continuous function f : W — > 1R 
and every e > 0, there is a trigonometric polynomial T(x), such that Vx S 
[-l,l?:\f(x)-T(x)\<e. 

Proof: See [H]. 

This theorem states that we can approximate any continuous function point 
wise arbitrary well as a multivariate trigonometric polynomial. 

However, since we are initially unaware of the the value function, we have to 
pick the degrees before the estimation. This leaves us with the question how big 
the estimation error will be for given degrees di,...,d p . One can expect that it 
is related to the smoothness of the value function at hand, which strictly does 
not even have to be continuous (e.g. at walls). See Lorentz [T7] for details. 
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3.2 Value Estimation 

In this section we want to present algorithms to estimate value and Q-valuc 
functions. Instead of giving a complete overview of the topic, we want to restrict 
us to model free approaches, because for control purposes we are only interested 
in the Q-value, anyway. 

We understand under model free value estimation the absence of an explicit 
model of the MDP, i.e. P(-, •, •) and ir(-, •). Also, the state space S and reward 
function does not need to be known. However, we explicitely need to know 
action space A and discount factor 7. Of course, there also exist explicit model 
approaches for value estimation. For example in [34| independent models of the 
transition matrix and the reward function were learned. It is also shown that 
for linear models their approach yields exactly the same solution as a model free 
approach. 

Because of these restrictions, we can only minimize the Bellman error |34) 

BE{V) = ¥< t [Rt+lV(.x' t )-V{x t )] (3.13) 

over a set of transitions xt x' t with reward Rt- 

In this thesis, we only consider the case of linear function classes as esti- 
mation model. The standard approach is called temporal difference learning. 
Classic algorithms like TD(X) and Monte Carlo [21] are left out for the benefit 
of least squares temporal difference algorithms. These circumnavigate a number 
of problems of classical algorithms we can not go into detail here. 

3.2.1 LSTD 

The least squares temporal difference (LSTD) algorithm estimates value func- 
tions of MRP and was first developed by Bradtke and Barto [52]. Later, in line 
with the extension of TD to TD(X), the algorithm was extended to LSTD(A) 
by Boyan [24] . We will discuss the original LSTD algorithm, but in a formalism 
that resembles Boyans extension. 

LSTD is defined on MRP, so we assume a training set 7 = {(<j>(xt), rt, (/>(x' t ))}f =1 
C (S X IR X S) of n transitions Xt — > x' t with reward r t . The sampling of x t 



Algorithm 5 LSTD 

Require: 7; {(</>&), Ri, </>(^))}™ =1 C (R m x IR x IR"') 
Ao = zeros(m,m) 
bo = zeros(m,l) 
for £= 1, ... ,71 do 

A, = Ai_i + <t>{xi){4>(xi) - 7</>K)) T 

bi = bi-i + (j){xi)Ri 
end for 

w = pinv(A„)6„ 
return w 
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affects the solution and should be as uniform as possible. The states can be 
represented by an arbitrary injective function <f> : S — > IR m . 

For readability, we define the matrices 3?^ = cj>i(xj) and <I>^ = 4>i(x'j). 



Optimization problem LSTD minimizes a cost function related to the Bell- 
man error 1221: 



C lstd (w,($,&,R)) = 



E, 



{x t )(v*{x t )-{R t+1 V"{x' t ))) 



- ||$(*-7*') t ™-*-R||f 



2 
F 

(3.14) 



with the linear function V n (x) = w T 4>(x) and E 4 [-] being the empirical mean. 
Deriving (cq. 13 . 14[) with respect to w yields the minimum w* : 



dC lstd {w, (*,*', R)) 
dw 



2 2 i 
-$ 1 $ T $$> &^ T G>R = 

n n 
(* 7 $ T *$^)" 1 * 7 * T *J [ ? 

(*(* -7*') T ) f *-R 



where 3> 7 = 3? — 7$' and f denotes the pseudo inverse. The last line holds if 
the rows of €>«i? 7 are linearly independent [24] . 

LSTD (algorithm [5]) has a complexity of 0(m 2 ) in space and 0(m 3 ) in time 
with m being the length of the state representation. 



3.2.2 LSQ 

The least squares Q-value (LSQ) algorithm was introduced by Lagoudakis et 
al. [35] to extend LSTD to Q- values. The basic idea is an extension of the state 
representation 0:5—^ JR m to state action representation <fi : S x A — >• TR m . 

Definition 3.13 (State action representation |26j ) An arbitrary linear rep- 
resentation <p : S — > m™ (e.g. of a continuous state space) can be extended 
by a discrete set of actions A = {ax, . . . ,a,k} to a combined representation 
4> : S x A -> WC nk with Vs g S,Vai G A : (f>(a,Oi) = [zj , <^>(s) T , zJ] T , 
z a G {0} (l " 1)n and z b € {O}*" 1 " 8 '". 



Algorithm 6 LSQ 

Require: <p : S x A ->• H Tn ; 7; {a*, a i; i? i5 x-, a£}™ =1 cfSxixlxSx^) 
for i = 1, . . . , n do 

4>i = 4>{xi,ai) 

end for 

«=LSTD( 7 ;{(&,ifc, 
return w? 
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Formally, we assume a training set 7 = {(xj, Oj, Ri, x\, aQj-^Lj 01 n observed 
transitions Xi -4 x^, the collected reward Ri and the next action o! i chosen 
according to n. The states might be available in an additional representation, 
but they serve only as input for <f>(-, •). 

Proposition 3.5 Given a stationary policy tt and a state action representation 
4> : S x A -> S', the Q-value Gf{s,a) of the MDP M = (S,A,P,R,j) is equal 
to the value V'{<j>{s,a)) of the MRP M' = (S',P',R',j), where 
Vs,s' G S*,Va,a' G A : 

(<) S' = U U U(s,a)} 

s£S a£A 

(ii) P , y>(s,a),<f>(s',a')) = P(s,a,s')n{s',a') 

(Hi) R(4>(s, a), </>(s', a')) = i?(s, a, s ) 
Proof: VseS.VaeA 

Q n {s,a) = ^ P(s,a,s')^(s,a,s') +7 ^ 7r(s / ,a')Q' r (s',a')) 

s'6S a'SA 

= ^(s'- a ')P( s , a i s')R(s, a, s') + jP(s, a, s')ty(s', a')Q K (s', a')^ 

s'eS a'GA 

= E E P'(4>(s,a),4>(s',a'))(R(<b(s,a), ( t > (s',a')) + 1 V'(<b(s',a'))) 

s'£S a'€A 

= V'(^(a,a)) 
The last equality holds only if <fr(-, •) is injective. □ 

LSQ combines the state and action using a injective function </>(•,•) (e.g. 
dcfinition l3.13[) . and then utilizes LSTD to estimate V'((j)(-, ■)). Since no explicit 
model of the MDP is needed, estimating V' (</>(■, •)) is & n easy way to estimate 
Q 7r (-,-). The policy affects the solution through a\ ~ w(x' i ,-), which satisfies 
definition 13.51 (Q-values) . 



3.3 Policy Iteration 

With lemma [3~2l we have found an intuitive way to realize the optimal policy by 
means of Q-values. These can be estimated model free by the LSQ algorithm 
(sec. 13.2^2")) . However, the training set is sampled using a (not necessary known) 
policy and LSQ needs a policy to choose a' . 

A solution to this problem is called policy iteration and consists of a sequence 
of monotonically improving policies 7To, • • • , ir m of the form |21| : 

ati(s) = argmaxQ"* (s, a) 

a 

Vs G S : 7Tt+i(s, Qfj(s)) = 1 (3.15) 

The evaluation of 7r m obviously requires an iterative alternation of Q-value es- 
timation of policy 7Tj and policy improvement (eq. 13. 15[) to achieve a so called 
greedy policy 7r,+i. For the initial policy 7To, one often chooses a random policy 
with equal probabilities for all actions. 
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Algorithm 7 LSPI 

Require: 4> : S x A -> M Tn ; 7; e 

Generate {xi,ai}™ =1 C S x i uniformly distributed 
Measure all and in experiments based on xi and 
Generate all a\ uniformly distributed for initial random policy 
/ / Policy iteration 
w = 00; w' = 
while 1 1 ic — w'\\ > e do 
w' = w 

w = LSQO; j;{x l ,a l ,n,x' l ,a' 1 }2 =1 ) 
for i G {1, . . . , n} do 

a\ = argmax(iB T </)(j;^, a)) 

a 

end for 
end while 
return w 



In the case of tabular state representation it is possible to prove the mono- 
tonic improving property of policy iteration. Approximations of Q n , however, 
complicate convergence proofs and depend on approximation quality and Q- 
value estimation algorithm |27j . 

3.3.1 Sampling 

For estimating Q-values, we need a training set 7 = {(a;,, a,, r^, x[, a^)}" =1 C 
SxixlxSxi. All x- — P(a:i, a i; •) and n ~ a*, •) follow the MDP 

and all a' ; ~ ^-(o;^, •) are drawn according to the current policy. The start states 
Xi and actions at, however, have to be sampled by another process. 

Most experiments are performed in trajectories. For every trajectory of 
length I G IN + we draw a start state according to some distribution x' ~ s(-) 
and follow a stationary policy n, i.e. Vi G {1, . . . , 1} : 

However, this approach can lead to some obstacles that can ruin the conver- 
gence of policy iteration. We will demonstrate the problem on a simple version 
of the Sarsa algorithm [53] and the solution with the LSPI algorithm [55] . 

Sarsa The easiest approach is to record a number of trajectories with a uni- 
form start distribution s and to follow the current policy, i.e. tt = tt^. 

Since equation 13.151 is based on the maximum of Q 7T (s, ■) in a state s G S, 
it is necessary that Q 7r (s,a) is approximated well for all a G A. This is not the 
case here, since every policy but the first is greedy. Sarsa will therefore choose 
the same action every time it visits the same stat^E This will lead to a poor 
approximation of all but the currently preferred action and therefore to poor 
policy convergence. 

1 To be precise, in |23| Sarsa chooses actions e-greedy, i.e. with a (iteration dependent) 
probability of e greedy and with (1 — e) random. 
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Algorithm 8 Control 
Require: 4> : S x A -> H Tn ; w 
while true do 

x = observe() 

a = argmax(w;* T 0(x, o)) 

a 

execute (a) 
end while 



LSPI It is therefore important to draw et^ as well as Si randomly, uniform 
distributed at best |27| . One way to achieve this is to record an initial dic- 
tionary with a random policy tt and change only a' between the iterations. 
This approach is called least squares policy iteration and ensures equally well 
approximated Q-values. 

On the one hand, this approach eliminates the need for resampling every 
iteration, which reduces the experimental overhead. On the other hand, there 
is no way to compensate unbalanced sampling in the dictionary. A random walk 
constrained by a complex environment can be very erratic, especially with few 
but long trajectories. Optimally, one would record trajectories of length 1 with 
s and tt being uniform distributions over S and A, respectively. 

Algorithm [7] shows the complete LSPI method, for more information see [55] 
and [27]. 

3.4 Conclusion 

In this chapter we presented an outline how to solve control problems with 
reinforcement learning. 

• Algorithm [8] shows how the control works after an optimal policy, repre- 
sented by the parameter vector w* , is obtained by LSPI. 

• For LSPI (alg. [7]), it is sufficient to record one random walk to train the 
complete control procedure. To fulfill LSPIs requirement of uniformly 
distributed start states and actions, one can rely on the original random 
walk properties (given enough examples) or sample a balanced set out of 
the video to create a dictionary. 

• Using LSQ (alg. [6]) and LSTD (alg. [5]) we can estimate the Q- value without 
knowledge of the underlying MDP. Only some samples in the dictionary 
have to be labeled as reward or punishment. 

• The states can be given in any representation, if we use definition 13.131 
to generate state action representations. However, a representation suited 
for linear functions is recommcndablc, e.g. a trigonometric representation 
(def.|332J|. 



Chapter 4 

Slow Feature Analysis 



In this chapter we want to introduce an unsupervised learning criteria called 
slow feature analysis (SFA, sec. 14.1. SFA is used to find continuous states in 
temporal data. As a big advantage, theoretical tools are available that predict 
the behaviour in a variety of situations (sec. 14.1. 2[) . We give an overview of 
recent applications in section 14.21 and conclude with the commonly used linear 
SFA algorithm (sec. 14.3. 1[) . Since linear SFA is not suited for the task in this 
thesis, we also develop a kernelized version of the linear SFA algorithm (sec. 

4.1 Introduction 

Biological systems start with little hard-coded instructions (e.g. genes). They 
grow, however, far more complex over time. It is estimated that the human 
genome consists of approximately 25.000 genes; the human brain alone is com- 
posed of 10 11 neurons, each with its own individual properties. Therefore, bio- 
logical systems must have a way of self-organizing and diversification, depending 
on the environment. 

To understand and reproduce this lcarning-without-a-tcachcr, the field of 
unsupervised learning was developed. In contrast to supervised learning (e.g. 
regression, section T2.ip no target value is available, thus the design of the cost 
function alone determines the optimal function. Since there is no explicit target, 
every optimization problem has to rely on a self-organizing principle. 

Temporal coherence The underlying principle of slow feature analysis are 
temporal coherent signals. 

Imagine a sensor of undetermined type. All we can see are the multidimen- 
sional sensor readings, which probably are distorted by some kind of uncorre- 
cted noise. Whether the sensor itself is moved or the environment changes 
around it, the readings will vary over time. We are searching for a filter which 
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applied on the sensor readings yields a meaningful representation of the current 
state of environment. 

As shortly addressed in the last chapter (sec. I3.ip . there is no mutual consent 
what exactly a state of the environment is. The approach taken here is to 
concentrate on the things changing over time. For example, if an object is 
moved in front of a static background, than the background (as complex as it 
might be) is completely irrelevant for the state. The object, however, is equally 
unimportant. Only its position varies over time, and is therefore the state of 
the environment. 

Keeping in mind that noise is considered uncorrelatcd to the state and there- 
fore not markovian, a filter specializing on noise will change quickly. On the 
other hand, most important things in the real world change slowly and con- 
tinuously over time. Therefore we should look for filters that produce a slowly 
varying output, while making sure that they vary at all and not just represent 
background. Because the state of something as complex as a real environment 
must be multidimensional, we should expect a set of filters, which have to be 
independent of each other. This means at least they have to be decorrelated. 

State spaces extracted with such a filter will be temporal coherent, but ignore 
fast changing processes, e.g. the ball in table tennis. 



4.1.1 Optimization problem 

There are several approaches to learn a filter as described above. They share 
some basic claims on the filters output which were formulated best by Wiskott 
et al. [351 [37]. 

• The mean temporal derivation of the output should be as small as possible. 
Normally the derivation is not known, so the discrete temporal derivation, 
i.e. the difference between successive outputs, is used instead. To punish 
large changes more than small ones, all approaches square the derivation 
pointwise. This value is called slowness. 

• To ensure that the output varies at all, the variance of the output is 
normalized, either by constraint or by multiplying the cost function with 
the inverse variance. This claim is equivalent to a zero mean and unit 
variance constraint. 

• The different filters should be independent of each other. Since statistical 
independence is hard to achieve, all sighted works restrict themself to 
claim decorrelation between the filters. 
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Definition 4.1 (SFA Problem [35] ^ 

Given a temporally ordered sequence of observations x^\ . . . , x^ £ X, we 
want to find a set of k mappings <pi{x) : X — > IR that satisfy: 

min s(<pi) := ~E,[4>i(x) 2 ] (slowness) 

s.t. ¥\4>i{x)\ = (zero mean) 

K[<pi(x) 2 ] = 1 (unit variance) 

Vj 7^ i : ¥i[<f>i(x)<pj(x)] = (decorrelation) 

where <f> denotes the temporal derivative and s{<j>i) the slowness of mapping fa. 
The indices i are sorted in ascending order according to their slowness, so the 
first mapping is the slowest. 

Grouping the outputs and its discrete derivatives into matrices = fa(x^>) 
and $it = 0;(:E ( ' t+1 " ) ) — fa(x^) we can express problem 14 . 1 1 more compact: 

min s(<&) = tr(<& T <I?) 
s.t. $1 = (4.1) 
i** T = I 

Other approaches Beside Wiskott et al. [35], we found two other optimiza- 
tion problems that follow the principle of temporal coherence: 

• Bray and Martinez |36j formulated as a related cost function the ratio of 
long- and short-term variance V and S: 

maxF(*) = - = -g^| (4.2) 

where fax) and (j>(x) are the centered long- and short-term means, e.g. 
fa(x^) = "Y^jtlt-m ri anc ^ an e cjuivalently defined <f> with smaller m. 
Minimizing the short term change (derivation) while maximizing the long 
term change (variance) follows the same principle as SFA. Decorrelation, 
however, is not required by this optimization problem and only appears 
in the results of [36] . because the problem is solved with an eigenvalue 
decomposition, for which the solution is decorrelated by default. 

• Einhauser et al. [40] used a cost function which is based on the same 
constraints but was formulated diffcrentljQ 

ma::m) _ j- mm 1 YT c[MxUj{x)? (43) 

1 1 ^ v[fa(x)] (k-iy^£;v[Mx)WM*)] ' 

ilti 

The first term on the right side incorporates the minimization of the slow- 
ness with the unit variance constraint, the second term represents the 
decorrelation constraint. 



1 The presented formalism is no direct citation but a readable version. 
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Wyss et al. [JT] used a weighted version of this cost function with an 
additional zero mean constraint embedded the same way. 

4.1.2 Optimal responses 

Consider we could solve problem 14. ll in the class of continuous functions. Since 
the form of the optimal function depends on the training sequence at hand, we 
are interested in the form of the output, which should always be the same. 

Definition 4.2 (Optimal responses) Let be an unsupervised optimization 
problem based on samples from set X. Let further f be the optimal function of 
based on any infinite training set I C 1 Then {f(x)\x G T} is called the 
optimal responses of . 

In our case, wc look for the function that describes the slowest output that still 
fulfills the constraints of problem 14.11 One can use variational calculus to find 
a solution independent of a specific input. [37]. 

Formally we assume some stochastic process that generates the training set. 
The process consists of a state in a bounded and continuous state space S and 
some generating parameters, e.g. a position s £ [0,1] C R plus a Gaussian 
random variable: s( t+1 ) = s^' +N(0, a), s would be the state and a the velocity 
of change. A function v : S — > X generates observations x € X out of the true 
states. 

There have to be some rules that constraint the stochastic process to S. 
The optimal responses will depend on what happens if a boundary is reached, 
i.e. if the state is stopped (reflected) or continues at the opposite boundary. 
These' rules are incorporated into the variational calculus approach by boundary 
conditions. 

Proposition 4.1 A set of functions which applied on a training sequence fulfills 
zero mean, unit variance, decorrelation and has orthogonal time derivatives, 
minimizes problem ^. 1\ within the space spanned by these functions. 

Proof: See [37]. 

This proposition essentially gives us the means to validate a proposition 
about optimal responses. We start with the simple case of a compact subset 
of the one dimensional Euclidean space. Without loss of generality, assume 
§ = [0,1]. 
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Proposition 4.2 (Cyclic boundary condition) Assuming a cyclic bound- 
ary condition, i.e. v(0) = the optimal responses are Vx £ [0, l],Vi S IN + : 



Proof: Proposition 14. II holds. See [37] for details. 

Any cyclic state, e.g. orientation or electrical phase, is subject to this con- 
dition. Note the similarity to trigonometric basis functions (cq. I3.9|) . i.e. with 
the exception of the constant rpo, all basis functions are present. The domain, 
however, is only [0, 1] instead of [—1, 1], reflecting the cyclic boundary constraint. 

Proposition 4.3 (Free boundary condition) Assuming no boundary con- 
dition, the optimal responses are Vx G [0, l],Vi £ IN + : 



Proof: Proposition 14.11 holds . See [37] for details. 

Any state between two limits, e.g. coordinates restricted by walls, is subject 
to the free boundary condition. In fact, this is another expression of the trigono- 
metric polynomial on the interval [0, 1] instead of [—1, 1]. Since the even trig, 
basis functions can approximate any even continuous function in the intervale 
[— 1, 1], they can approximate any continuous function in [0, 1] [37] . 

However, many realistic scenarios require multidimensional state spaces. For 
example, a robot who can move freely on a plane and look in any direction, 
requires a 3-dimensional space. Franzius et al. |42j derived a solution for this 
case. They considered a stochastic generation process where an agent (e.g. a 
robot) can move in the direction it faces and rotate to change this direction. 
This implies a state space § = [0, l] 3 , where the first two directions represent 
two spatial degrees of freedom and the last the orientation, all scaled to the 
interval [0, 1]. 

Proposition 4.4 (3d state space) Let S = [0, l] 3 be a state space with free 
boundary conditions in the first two, and a cyclic boundary condition in the last 
dimension. Under the condition of decorrelated movements in the training set, 
the optimal responses are Vx, y,9 € [0, 1] : V(i, j, I) £ IN 3 \ {(0, 0, 0)} : 



Proof: Proposition 14.11 holds . See [12] for details. 

The solution resembles the trigonometric representation (def. I3.12|) . but uses 
the basis functions of the one dimensional boundary cases instead of ip. 

However, these analytical solutions are exceptions. Experiments show that 
non-rectangular state spaces of the same dimensionality develop different opti- 
mal responses (sec. I5.2.3|) . 




(4.4) 




(4.5) 




(4.6) 
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Proposition 4.5 (Mixture responses) If <fii and <f>j are two optimal func- 
tions of problem \4- 1\ with equal slowness, then any linear combination a<pi + b(f>j 
has the same slowness, as long as a 2 + b 2 = 1. 
Proof: s(a<j)i + b4>j) = E[(a<fo(a;) + b4>j(x)) 2 } 

a 2 E[^(x) 2 } - 2abE[4> l {x)4> J {x)] + b 2 E[<Pj(x) 2 } 
= a s((pi) + b s((pj) = (a +b )s(<«) □ 

Intuitively, if we have two sets of optimal responses with equal slowness, any 
rotation between them is also a set of optimal responses, since it neither violates 
the constraints nor changes the slowness. In the case of one dimensional state 
spaces, only numerical errors can result in mixture responses. Solutions of 
proposition 14.41 on the other hand, are prone to exhibit mixtures. 

Let us consider a mean driving velocity v (influencing the slowness of spatial 
dependent filters) and a mean rotational velocity lo (influencing the slowness of 
orientation dependent filters). Due to the same mean velocity in both spatial 
directions, we have for any a 2 + b 2 = 1: 

Vj, I G IN, V* G IN+ : sffifl) = s(<p*u) - s{p4* ifl + bcj>* a ) (4.7) 

Therefore, we have to expect mixture responses between all filters of equal 
slowness. Note that this does not change the number of decorrelated filters, 
only their output is no longer determined by proposition 14.41 

4.2 Applications 

To the authors knowledge, there are no practical applications based on slow 
feature analysis. There are, however, a couple of scientific works that investigate 
its potential in different contexts. 

4.2.1 Receptive fields 

The first brain area processing visual information from the eye is called visual 
cortex. It contains cells that specialize on specific patterns (e.g. bars of specific 
orientation) in a small part of the perceived image, a so called receptive field. 
The question arises how these cells differentiate and adopt patterns that allow 
an efficient coding of the perceived scene. 

Bcrkes and Wiskott (38] used SFA to learn filters that react on patterns 
similar to those of real cells. They constructed a random process that moved a 
frame over natural pictures by translation, rotation and/or zooming (fig. 14.1b ) 
Input samples consisted of two successive frames to represent a special neuronal 
structure called complex cells (fig. 14.1b ). The results (fig. 14.1b ) resemble phase 
shifted gabor filters, as they are found in these cells. 

The results build a convincing argument, but for the sake of completeness 
one has to add that there are other unsupervised methods that produce similar 
filters. For example, Lindgren and Hyvarinen used quadratic ICA and also 
produced comparable results, though in a smaller resolution |43j . 
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a) Example training sequence b) Da|a format c) | nptlts with highes , (+) arx| | owes , ( _, fesponses 

Figure 4.1: Experimental setup and some results of Berkes and Wiskott [38] . 



4.2.2 Pattern recognition 

Naturally SFA predicts continuous states. While discrete states (e.g. classes of 
patterns) present no problem in principle, the absence of a temporal structure 
is. 

Berkes constructed a classificator for handwritten digits from the MNIST 
database based on SFA [39] . Instead of a temporal derivative, two training 
digits of the same class were subtracted, thus minimizing the distance between 
alike patterns. For the final classification, Gaussian distributions were fitted to 
the digit classes, each representing the probability of membership. 

With 1.5% error rate on a test set of 10,000 samples, the specified classifier 
performed comparable to established algorithms in this field, e.g. the LeNet-5 
algorithm misclassified 0.95% of the test set in comparison to 12% by a linear 
classifier. 

4.2.3 Place cells 

Rodents are known to develop specialized place cells in hippocampal areas after 
familiarizing them self with their surrounding. These cells are only active in one 
part of the room, independent to the heads orientation (fig. 14.6b ). Together 
with head- direction cells, for which the output depends only on the orientation, 
independent of the rodents position, they are believed to play a mayor role in 
the rodents sense of navigation [42] . 

Franzius et al. [42] [44] developed a system that exhibits place and head- 
direction cell like characteristics by applying SFA to the video of a virtual rats 
random walk. First, the random walk of a virtual rat in a rectangular room 



Figure 4.2: Some handwritten digits from the MNIST database used in [39] . 
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(fig. 14.3b ) was recorded with a virtual wide angle camera mounted on the head 
(fig. 14.3b ). The single frames were segmented in overlapping patches (nodes in 
fig. 14.3b ) in line with the concept of receptive fields. SFA was performed on 
all patches, then the output was rearranged to overlapping receptive fields of 
the next layer, and so forth. This way the multi-layer network was able to cope 
with the high dimensional video input, despite the fact that linear SFA (sec. 
I4.3.ip with a quadratic expansion (sec. 14.3 . 2[) of the patches was used for layer 
wise training (fig. 14.3t h). 

By using a suitable (i.e. powerful enough) function class in the SFA train- 
ing, the output of the last layer has to resemble the theoretical predictions of 
proposition 14.41 Since the mean rotational and translational velocity of the rat 
influences the slowness of the optimal responses s(</>^;) and therefore their or- 
dering, it is possible to extract orientation or location invariant features by 
controlling the random walk. Slow movements with fast rotations lead to ori- 
entation invariance in the first SFA filters (fig. I4.4|) . Slow rotations and fast 
movements, on the other hand, will produce position invariance. 

If we want to derive place cells, higher SFA filters are useless. The first 
orientation dependent filter </>q 01 will mix with all orientation invariant filters of 
equal slowness, so the mixed output will not be invariant to orientation anymore. 
Therefore, the virtual rat moved slow and rotated fast to obtain the results in 
figure PI 

The last step of Franzius et al. (fig. 14.3b ) was the application of independent 
component analysis (ICA, [2]) on the last layers output. The results (fig. 14.6b ) 
resemble the measured place cells in rats (fig. 14.6a ). 
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Figure 4.4: SFA responses from Franzius et al. [42]. Every pixel in a sub- image 
represent one position in a square room. The colors indicate SFA responses. 
Image source: Poster presentation by M. Franzius at the 2007 meeting of the 
German neuro-science society. 



Another approach to a similar problem, that aimed to understand the di- 
versification in the visual system, has been made by Wyss et al. |41j . They 
designed an online algorithm based on equation 14.31 with a nonlinear function 
class. 

As for Franzius et al., the key element was a hierarchical processing of over- 
lapping receptive fields with the same learning principle in every layer. The 
video images were of smaller resolution and they used 5 instead of 3 layer. The 
main differences resides in the use of a real robot and another magnitude of 
training examples: 66h at 25 Hz « 6 • 10 6 frames [JT] in comparison to "a few 
laps within 5,000 samples" [4"2"] . 

The results of the last layer arc orientation invariant and look like (big) 
place cells (fig. 14. 5|) . but since no theoretical solutions are available, it is hard 
to comprehend why. In the end the results of Wyss et al. set up more questions 
than they answer, so this thesis sticks with the methodology of Franzius et al. 



Level 1 Level 2 Level 3 Level 4 Level 5 




Figure 4.5: Two examples for every layer of Wyss et al. [JT]. The left of each 
example is the mean response at every position and the right represents the 
orientation stability in all directions within one standard deviation (gray). 
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a) Measured Rat Place Cells 



b) Artificial SFA/ICA Place Cells 



Figure 4.6: Simultaneous recorded place cells (a) in comparison with results 
from Franzius et al. [42] (b). All sub- images represent cell activity levels at 
different places with a ceiling camera view on a square/rectangular room at 
which the rat familiarized itself by a random walk. Image source: |44j 



4.3 Algorithms 



In this thesis we want to use optimization problem 14.11 i.e. the formalism 
developed by Wiskott et al. First we will review linear SEA and its common 
extension, to finally define a kernclizcd SFA algorithm. The latter is novel in 
this form, but has similarities to existing works. 



4.3.1 Linear SFA 

We aim to find a set of k functions that solve minimization problem 14.11 with 
respect to n observations {x^}f =1 C TR d . Despite the name, we consider affine 
functions <fii(x) = wjx — ci as model class. To keep the notation simple, we 
use a matrix notation of filter responses (<& and <&, as defined before) and 
observations X it = xf' and X it = a^* — xf' 

The k solutions are calculated simultaneously in 3 steps [35] : 

1. Centering of the data: where x = E[jc], 

2. Sphering of the centered data: x s = Sx c , where E[a; s a;J] = I. Sphering 
establishes unit variance and decorr elation. An eigenvalue decomposition 
of the covariance matrix E[a; c a;J] = ^ XX T — xx T = UAU T , can be 
used to determine the sphering matrix S = A~ 



xx 

-1/2 V T 



Minimizing the slowness: This can only be achieved by a rotation of the 
sphered data, since any other operation would violate the already fulfilled 
constraints. Again, an eigenvalue decomposition of a covariance matrix 
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Algorithm 9 Linear SFA 



Require: fc e M + ; {x^}f =1 C M d 

1 1 Collect covariance matrices 
X\ = 

Ci = xW X W T 
Ci = zeros(d, d) 
for i = {2, . . . , n} do 

Xi— \ -\- x^ ' 
Q = Ci_i +a;( l )a;W T 
6 4 = 6i_i + iWa!W T 
end for 



/ / Calculate sphering matrix S 
UAU T = eig(i C„ - ^x n xl) 
(U, A) = remove_zero_eigenvalues(U, A) 
S = A-'/ 2 U T 

/ / Find fc slowest direction in sphered data 
UAU T = eig( H i T SC„S T ) 

(Ufc,Afc) = remove_all_but_lowest_k_eigenvalucs(U, A, fc) 

// Return afhne function (f>(x) = W T a; — c 
w = S T U fc 
c = iW T i„ 
return (W, c) 



E[i; s i;7] = ^j-j-SXX T S T = UAU T can be used. The solutions are given 
by the eigenvectors corresponding to the fc smallest eigenvalues. 

Combining the steps, the affine function <j> is given by <fi(x) = W T a; — 
c, where W = XJA^^-^XJj and c = W T S. Since the eigenvalue decom- 
positions have complexity 0(d 3 ), algorithm [9] has an overall complexity of 
max(0(d 2 n),0(d 3 )). 

4.3.2 Expanded SFA 

When linear SFA does not yield sufficient slowness, i.e. linear functions are too 
weak, we might wish to use a nonlinear function class. A common way is to 
expand the data into a nonlinear feature space and perform linear SFA on the 
expanded data [33] . 

The most popular expansion is the set of all monomials up to degree 2 
(quadratic expansion) 1381 142| or degree 3 [39]. Linear SFA with these expansions 
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is sometimes referred to as quadratic SFA and cubic SFA, respectively. Since the 
dimensionality of expanded inputs grows rapidly, most authors reduced it first 
with a principal component analysis (PC A QT]) [38] [39] [42] . Because monomials 
of extreme positions can also take extreme values, Franzius et al. included a 
clipping of high outputs in their procedure |42j . 



4.3.3 Kernel SFA 

The idea of projection into a high dimensional feature space reminds of kernel 
techniques, so we might avoid explicit expansions and develop a kernelized SFA 
algorithm, instead. For a related cost function a kernel approach has already 
been made by Bray and Matrinez [36] . Besides the difference in the cost function 
their approach differs from this one because they do not use a kernel matrix 
but work directly with support vectors. Using the projected process kernel 
matrix approximation method, their approach turns out to be a special case 
of this algorithm. In the following let k(-,-) be a given kernel and ip(-) the 
corresponding feature mapping. For a definition of these terms, read section [2~2l 
Again, we group the data projected in feature space in a matrix — ipi(x )■ 
In line with linear SFA, the kernel SFA algorithm consists of three steps: 

1. Centering can be performed directly on the kernel matrix (sec. I2.2.4|) : 
K c = (I — ill T )K(I - ^H T ). Note that we implicitly centered the data 
in feature space this way, since K c = *&J*& C . 

2. Sphering: As discussed in section [2. 2. 4[ we substitute the nonsingular 
eigenvectors U in the sphering matrix (in feature space) with those of the 

kernel matrix: S = -^A^Vj^ . 

v n 

3. Minimizing the slowness: Let * := \&D, with D e IR nx "~ 1 being a 
matrix which is zero everywhere except for ; = —1 and Dj+i^ = 1. 
This way we can express the covariance matrix ^-!-j-S\E'\E' T S T in feature 
space as n ^_i) A~ 1 V T KK T VA~ 1 with K = * Tl 3/D. The eigenvectors 
corresponding to the smallest k eigenvalues of this covariance matrix are 
the rotations U& which optimize the problem. 

The kernel solution to optimization problem 14. H is given by <p(x) = A T k(x) — 
c, with the weight matrix A = (I — -!ll T ) VrA" 1 !)^, the kernel expansion 
fc(aj) := [fc(a;( 1 ),a?),...,fc(a;( n ),a;)] T and the bias c = ^A T K1. 

The algorithm collects the matrix K and derives KK T from it. The two 
eigenvalue decompositions in steps 2 and 3 are responsible for an overall com- 
plexity of 0(n 3 ). 
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Algorithm 10 Kernel SFA with kernel matrix approximation 
Require: k : M d x ^(lR d ) -> IR m ; fc € IN+; i C {x^j^i C K d 

/ / Collect kernel matrices 

// In the following: feM := fc(ajW, {x^}^) 

ki = fcW 

KKj = k^k^ T 

KK| = zcros(m, m) 
for i = {2, . . . , n} do 

fci = fc,_i + feW 

KK^ = KK i L 1 + fc«fc« T 

= KK7_! + fc«fc« T 

end for 



/ / Calculate sphering matrix S 
VAV T = cig((I - Ill T )KK,T(I - IllT)) 
(V r ,A r ) = remove_zero_eigenvalues(V, A) 
S = A7 1/2 Uj 

II Find k slowest direction in sphered data 
UAUj = e ig(^sMjS T ) 

(Ufc,Afe) = rcmovc_all_but_lowest_k_eigenvalues(U, A, k) 

If Return kernelized function <f>(x) = A T k(x, {x^ l '}^L x ) — c 
A = S T U_ fc 

C — k'n 

return (A, c) 



Kernel matrix approximation Kernel SFA performs much better than lin- 
ear SFA, but raises the complexity to 0(n 3 ). One would like to apply the 
projected process method described in section 12.2.51 but this method only pro- 
vides approximations of K 2 . As already discussed, K 2 = VA 2 V T has the same 
eigenvectors and squared eigenvalues of K. If we perform the eigenvalue decom- 
position on KK T and take the square-root of the eigenvalues, the solution is 
the projected process approximation of K. 

With this approximation, algorithm HU1 has complexity 0(m 2 n) because it 
requires the collection of the matrices KK T and KK T . Additionally, one has to 
pick a set of support vectors, which usually is costly, too. For more information 
on the selection of support vectors, read section [2.2.61 
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4.4 Conclusion 

Slow feature analysis is a unsupervised method to extract the current state 
out of any data with a temporal structure. It learns an array of filters that 
aim to extract all non-static parameters that distinguish the input from other 
samples. Given a powerful function class and enough training samples, the 
output will converge to a trigonometric representation of the current state, as 
far as it can be extracted out of the current input. As such it is well suited 
for the approximation of continuous functions on this state, as the discussion of 
trigonometric representations in section [3.1.31 shows . 

However, instead of converging to the theoretical optimal responses, filters 
with equal slowness can mix linearly. This is not a problem for function ap- 
proximation, since the mixed solutions span the same space of trigonometric 
basis functions. If, on the other hand, one is only interested in a subset of the 
state components, e.g. only the spatial component of the state space described 
in proposition 14. 4| the mixture prevents simple exclusion of unwanted filters. 
Because the number of filters grows exponentially in the number of state com- 
ponents, every unwanted component is a nuisance. If possible, one can obtain at 
least the first few filters undisturbed by controlling the relative velocity between 
state components in the training set. 

Under controlled circumstances (e.g. in a simulator) we are interested in the 
whole state space and can therefore ignore these effects. Real life experiments, 
however, have shown many additional components, e.g. altitude of the sun, 
artificial light, etc. Every independent source of change in the light conditions 
is considered a state component as well as any change in the actual scene. 

At last one has to point out that this method can only be applied to static 
scenes. Something as complex as a group of humans induces a state space far 
too complex for any function class feasible in the near future. 



Chapter 5 

Empirical validation 



5.1 Experiment description 

The main experiment of this thesis is a proof of concept of the following propo- 
sition: 

Proposition 5.1 Policy iteration can learn a control based only on a current 
video image, which is preprocessed with filters obtained by slow feature analysis. 

We showed in chapter [4] that SFA, applied on a random walk in bounded 
state spaces, converges to filters that form a trigonometric representation of this 
state space. This claim holds, independent of the representation in which the 
state is originally presented to SFA. Therefore, the state can be presented in 
form of video images, recorded from the perspective of a robot which is moving 
through a static scene. SFA will converge to a trigonometric representation of 
the robots position. 

As we have shown in chapter [31 the policy iteration method LSPI is able 
to learn a control out of a random walk presented in any representation. Of 
course, the quality of the control depends on representation and considered 
function class. However, we have also shown that trigonometric representations 
are especially suited for the class of linear functions, as employed by LSPI. 

To validate this approach, we chose a navigational experiment in which a 
robot is supposed to drive into a goal area. The only available sensor is a head 
mounted camera. Obviously, its video images are too high dimensional and too 
complej|l to promise LSPI solutions with reasonable quality. 

However, as results in section 15.3.31 show, after preprocessing the learned 
control was able to find the goal in ps 80% of the test trials. 



1 LSPI has a complexity of 0(a 2 d 2 ) in space and 0{a i cP) in time, where a is the number 
of actions and d the dimensionality of the state representation. 

2 For example, a small rotation of the robot leads to drastic changes in nearly every pixel. 
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d) Validate Control 

Figure 5.1: The experiment: a) Record a random walk video with the robot, 
b) Learn the SFA filters based on this video, c) Estimate the Q-value based on 
the preprocessed video, d) Validate the control at random start positions. 



5.1.1 Overview 

The main experiment is conducted in 4 phases, as depicted in figure [5TT1 

a) Record of a random walk video with a camera mounted on the head of a 
robot. 

b) Learn a preprocessing based on the random walk video with kernel SFA. 

c) Estimate the Q-values based on the preprocessed random walk video with 
LSPI, to derive a near optimal policy. 

d) Validate the learned control at random positions by choosing the action 
associated with the highest Q-value until the robot reaches the goal. 

5.1.2 Formal description 

Task The task is to navigate as quick as possible into a goal area. This area is 
not marked or discriminablc to other parts of the environment, besides by the 
reward given when the robot is inside. The only other reward is a punishment for 
getting too close to walls, i.e. most of the environment is without reinforcement 
signal. 

Environment The environment is static, bounded and the robots position 
recognizable almost anywhere based on the camera images. That means we 
have a closed room which is asymmetric at least in the textures and therefore 
has few positions that resemble each other, which is the case in many indoor 
scenarios. It is the claim of staticity that introduces the most problems, since 
it restricts the application to uninhabited areas. It is, however, a restriction of 
the slow feature analysis and therefore necessary for this thesis. 

Behaviour We assume decisions at discrete time steps between a limited num- 
ber of discrete actions. As a side effect, the discretization liberates us of any 
time constraint. The learning algorithm does not know the actions in use, but 
they have to allow the robot to navigate anywhere. For our experiments we 
chose 3 actions: move forward ca. 30cm and turn left/right ca. 45 degrees. 




a) Record Random Walk b) Learn Preprocessing c) Learn Policy 
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a) Roto! b) Robot View c) Celling View 



Figure 5.2: The wheeled Pioneer 3DX robot used in the experiments, a picture 
from the head mounted camera and from the observing camera at the ceiling. 



5.1.3 Robot 

For simplicity, the robot is assumed to be wheeled and is able to execute com- 
mands like "turn 45 degrees" or "move 30cm". However, the execution can 
be flawed, e.g. the robot can turn 48 degrees instead of 45 or move 32cm in- 
stead of 30. The NeuRoBot project used the Pioneer 3DX robot build by 

MOBILEROBOTS INC (fig. 15.2b ). 

Camera The head mounted camera has a wide field of view to impede similar 
images at different positions. Beside this, the only restrictions are a reasonable 
resolution and image quality. The Bumblebee® camera used by NeuRoBot 
has a field of view of 66° , which seems to be sufficient anywhere but directly in 
front of walls. It also records a pair of stereo images, of which we only considered 
the left one (fig Ob). 

Other sensors For navigation, no other sensors are necessary. The considered 
policy iteration method, however, requires a random walk before training. This 
implies at least one dependable sensor system to avoid/react to walls during this 
phase. The Pioneer 3DX is equipped with ultrasonic sensors and sensitive 
bumpers. To treat the test environment with care, we employed the ultrasonic 
sensors to avoid walls during random walk. 

Validation To validate the learned control, one needs to know the robots 
position in training and testing. The internal position estimator of the Pioneer 
3DX turned out to be useless, since local estimation errors are integrated over 
time. After a few minutes, the error was unacceptable. Instead, we installed a 
web cam at the ceiling and extracted the position of the illuminated blue and 
red ball on the robots head from it (blue and red squares in fig. 15.2b - ). 
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Same time, 5° rotation Same position, 1 min. later 

Figure 5.3: Illumination changes drastic under daytime conditions. 



5.1.4 Environment 

In more detail, the requirements on the environment are: 

• Bounded space: Training SFA requires to travel the whole state space, 
multiple times if possible. An open field (without boundaries) would make 
this impossible, therefore we need a closed room. 

• Asymmetry: If two positions produce the same image, SFA will assign 
both positions the same response. This violates the assumption that the 
learned representation of states is injective (def. 13.71) . 

• Static scene: The robots position is the only variable in the environment. 
If other factors change the image reliably, e.g. whether an operator is in 
the room or not, they will introduce new SFA filters. This would increase 
the number of extracted filters without gaining additional information 
useful for the navigation problem. 

As it turns out, natural light conditions are not static at all. The suns 
position and every cloud shifting in front of it changes illumination and reflection 
in the scene. Due to the build-in brightness correction of the camera, the images 
suffered shifts in color, contrast and brightness almost independent of the robots 
position (fig. I5.3[) . To circumvent this, we restricted our experiments to artificial 
illumination at night. Reflection and brightness still change, but remain almost 
static with respect to the robots position. 

The experimental environment used in this thesis consists of a rectangular 
3m x 1.8m area, which is bounded by tilted tables and a wall. All four sides 
were covered by the randomly assembled wallpaper in figure [5T41 The walls were 
ca. 90cm in height, so the robot could see parts of the laboratory and therefore 
the human operators. To avoid human interference, the upper part of the image 
was clipped out (as depicted in fig. [5] 
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Figure 5.4: Randomly assembled wallpaper of the experimental environment. 
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Camera Simulator Camera Simulator 

Figure 5.5: Simulated environment with rendered images of 3 example positions 
and their real-world counterparts recorded by the Pioneer robot. 



5.1.5 Simulator 

Working with the Pioneer robot, however, proved to be time consuming. In 
addition, producing high resolution test maps (sec. I5.1.6[) is nearly impossible. 

The author implemented a simulated version of the experiment to verify the 
results in large scale tests. The simulator is written in JAVA3D and communi- 
cates with the Matlab implementation of algorithm [8] by providing a network 
service. The render engine does not provide shadows or directed illumination 
(only ambient light). Instead it relies on photographed textures of walls and 
laboratory. 

Figure 15.51 shows an overview of the rendered environment from a ceiling 
camera perspective. It also gives three example images and their real-life coun- 
terparts, recorded by the robot. The first difference one will notice is the differ- 
ence in color. The textures were photographed at daytime, the camera images 
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recorded at night. Artificial light has a shift to yellow compared to sun light 
reflected on white walls. The second inaccuracy is the unrealistic floor texture. 
The whole floor as texture simply did not fit into the graphic cards memory 
Additionally, the real floor provides reflections that could not be modelled prop- 
erly. At last, the Bumblebee camera lens distorts the outer areas of the image, 
which also could not be simulated. 

The simulator is not meant to copy reality perfectly, but to allow a thorough 
verification under similar conditions. 

Other environments Until now we remained inside the boundaries of theory. 
We know that slow feature analysis converges to a trigonometric representation 
in rectangular rooms. The only remaining question is how well (which will be 
answered in section I5T2]) . Predictions for other room geometries, however, are 
hard (and maybe impossible) to derive, so the question arises how SFA will 
handle such a case. With the simulator at hand, we will examine this question 
for an example room. 

Many room shapes are possible and interesting. We chose two quadratic 
rooms, connected by a small corridor, which we will call two-room environment 
(fig. 15. 6[) . Since we do not have to rebuild any existing room, we are free to 
use publicly accessible libraries of textures, from which we chose an Egyptian 
theme. To make the scene more appealing the author decided to place one room 
outdoors, restricted by a wall of medium height. To prevent similar images at 
different positions, all "inside" walls differ in portrayed reliefs and "outside" a 
couple of different sized pyramids were placed arbitrary in the far distance. 

5.1.6 Video generation 

Both policy iteration and slow feature analysis need random walks as training 
sets. Both methods require balanced sampling over the whole state space. 

As simple as a random walk is to implement, the robot can get caught be- 
tween boundaries as trivial as rectangular corners. Special movement statistics 
(as required for place cells, sec. I4.2.3j) and coarse actions (as the proposed 
45° rotation) introduce additional jamming opportunities. Instead of facing 
these difficulties we decided to record one video (in three sessions) with move- 
ment statistics that seem to visit all parts of the environment equally often. Out 
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of this video we sampled one training set for SFA and one for policy iteration. 
Since we recorded the random walk with the ceiling camera, we were able to 
duplicate it in the simulator. 

The more complex the environments geometry gets, the more likely is a 
random walk to get stuck. For example, in the two room environment the 
connecting corridor is a bottleneck. It is much more likely to stay in the room 
than to hit the corridor and change it. In the long run this effect will cancel out 
statistically, but with limited training samples it is unlikely to obtain a balanced 
set by random walk. 

Since we only performed this experiment with the simulator, we sampled 
the training set as pairs of successive frames at random start positions and 
movements. This may sound unrealistic but is an easy way to achieve guaranteed 
uniform distributed training samples. 

Test sets Sampling a test set (unseen in training) out of the random walk is 
not complicated. However, sampling a meaningful test set is. 

To compare the orientation/place independence of single filters one optimally 
expects a high resolution training set of images at equidistant coordinates with 
the same orientation. A set of these test maps with different orientations, can 
give an overview of the filter responses. 

The simulator can produce exact test images, so we created a set of 8 test 
maps (in 45° steps) with 64 x 32 equidistant coordinates, each (e.g. in figure 
15.11b ). Creating test maps for the real experiment, however, is not that trivial. 

To find a test set which comes near to the described test map, we filtered 
all frames within 5° of the desired target map orientation. Out of this set, we 
used frames as equidistant distributed as possible. As a result, we can only give 
a scatter plot of the test map in which the test samples can differ in angle up 
to 10° (e.g. in figure ETQa) . 

Image resolution Kernel SFA with projected process matrix approximation 
(algorithm I10[) has a complexity of 0(m 2 ) in space and 0(m 2 n) in time. Here 
m is the number of support vectors and n the number of training samples. This 
complexity does not seem to be affected by the input images dimensionality d. 

However, We have to store the support vectors and perform a kernel expan- 
sion at every frame, too. This induces an additional complexity of 0(md) in 
space and 0(mdn) in time (assuming a kernel with complexity 0(d) in time, 




Figure 5.7: Example video sequence before and after preparation. 
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Figure 5.8: Video image preparation steps. 



e.g. the RBF kernel). As long as d < to, this cost does not exceed the cost of 
performing kernel SFA and we are relatively free to choose the image resolution. 

If we exceed this threshold significantly, however, the new costs depend linear 
on the number of pixels, which is quadratic in image resolution. For example a 
320 x 240 RGB image contains d = 230400 independent values which exceeds 
the maximum number of support vectors used in this thesis (to < 8000) by a 
factor of 28. Thcrfore, we scaled the image down to a relatively conservative 
resolution of 32 x 16 RGB pixels (d = 1536). To investigate the influence of 
image resolution, we also duplicated one experiment with a resolution twice as 
high (64 x 32 RGB pixels d = 6144, see section G£2J|. 

Keeping considerations from section 15.1.41 in mind, we performed a three 
step transformation on the original video images before presenting them to SFA 
(see fig. 15.81 for the process and fig. 15.71 for the results): 

1. Clip out the upper 80 lines to avoid human interference from the labora- 
tory. 

2. Correct mean brightness to counter the cameras automatic adjustment. 

3. Shrink the image to size 32 x 16. 

5.1.7 Alternative approaches 

Most reinforcement learning approaches have customized representations (sec 
for example Stone and Sutton [25] ). These are not comparable to an adaptive 
technique like SFA. To the authors knowledge, no other approach can estimate 
the current position based on a single frame. 

The technique coming closest to position estimation based on video images 
is called simultaneous localization and mapping (SLAM or Visual SLAM in the 
context of video input). Originally, Smith et al. |45j investigated the problem 
for vehicles with sonar sensors. Roughly a decade ago, Davison adapted 
the method to Visual SLAM, which is currently under development by several 
groups 03|521I53|. 



5. 1 . EXPERIMENT DESCRIPTION 



55 



The classical solution, as well as many modern approaches [IS] [S3] employ 
extended kalman filters (EKF) to estimate landmark positions. These suffer 
0{n 2 ) complexity in the number of landmarks n, due to statistic dependencies 
between their estimation. Landmarks are small 2d patches with their estimated 
3d position, which should be as relocatable as possible. For this, a variety of 2d 
transforms arc in use, e.g. affinc transformation [S3J or SIFT [55]. Comparing 
all possible patches with all known landmarks is computationally expensive [52] . 
Instead, it is possible to incorporate information about the expected location 
in so called active measurement strategics. This way, Visual SLAM can be 
performed in real time [47] [49] [S3] . 

To cover larger environments, the Rao-Blackwellized particle filter was first 
introduced by Murphy [SO] followed by a practical framework by Montemerlo 
[51] . Out of the perspective of every particle, the camera position is fixed and 
the landmark estimation thus independent, yielding 0(mn) complexity for m 
particles. In the field of Visual SLAM, Sim et al. [52] and Eade et al. [53] 
successful applied these particle filters. 

Alternatively, Estrada et al. [IB] extended the EKF approach by producing 
independent local maps of limited size. These are stitched together in a global 
layer thus the algorithm is called hierarchical maps. Recently, Clemcnte et al. 
[49] applied this approach to Visual SLAM. 

Whereas Visual SLAM is a well established field with remarkable success, 
our approach differs in some mentionable ways: 

• SLAM depends on a specific type of sensor input, whereas SFA automat- 
ically adapts to any sensor. 

• SLAM holds a current state, which is crucial for localization and prone 
to initialization errors. In contrast, SFA works instantaneous and can 
localize the robot using one frame only. 

• SFA extracts a trigonometric representation of the position. SLAM, on 
the other hand, estimates position in 3d coordinates. 

To compare SFA to SLAM, one would have to expand the position estimates to 
a trigonometric representation. These could be used instead of the SFA output 
to learn a control. Because SLAM algorithms are quite complex and not easy 
to use, the author could not provide such a comparison in the available time. 

Anyway, even if we could, SLAM is bound to have an internal state. Com- 
parison would only be fair when we grant such a state to SFA, too. This would 
violate the proof of concept approach of this thesis. 
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5.2 Learn preprocessing 

Before we learn the navigational control in the next section, we want to validate 
the theoretical predictions of SFA as a preprocessing (chapter 0]). The main 
concern is, how well kernel SFA performs in our context, i.e. how close to the 
optimal responses one can come with current computers. 

We evaluate different SFA algorithms, which in fact reflect different function 
classes fitted to optimize the SFA cost function (sec. 14.1. The point of 
reference are always the optimal responses that can be expect in the limit (sec. 
14X21) . 

The videos We recorded 3 videos in different environments to test SFA pre- 
dictions ( sec. 15X41 15X51 fc 15X61) : 

• Robot experiment: The original random walk video recorded by the Pi- 
oneer robot. We reduced the frame rate to ~ 1Hz, resulting in 35128 
frames under relative stable conditions at night. 

• Simulated experiment: Due to the ceiling camera, we were able to record 
the same trajectory as in the robot experiment in the simulator. 

• Two room experiment: To test the SFA responses in different geometries, 
we recorded a simulated training set in the two room environment. The 
video contains 35000 random transitions at random start positions. 

The algorithms Wc applied linear SFA (algorithm [9} directly on the raw 
video images. This algorithm restricts the image resolution stronger than kernel 
SFA. Using 32 x 16 RGB pixel images turned out to be feasible without problem, 
but 64 x 32 RGB pixel images rise the computational demand by a factor of 64. 
We therefore had to omit the latter experiment with linear SFA. 

Our main focus centered on kernel SFA falgorithm [T0|) . Because all videos 
contain ~ 35000 frames, wc employed the projected process kernel matrix 
approximation method with a greedy support vector selection (algorithm |4]). 
Through the number of support vectors, one indirectly controls the model com- 
plexity of this approach. Therefore it is especially suited to investigate the SFA 
behaviour for different function classes. While the complexity of kernel SFA is 
susceptible to large image resolutions too, the 64 x 32 pixel resolution did not 
cause overwhelming overhead, as for linear SFA. 

Wc only considered the RBF kernel (sec. I2.2.3[) . Other kernels seemed 
promising too, but RBF kernels proved to be very reliable as long as the support 
vectors are drawn out of the complete state space. Because the greedy selection 
algorithm allows only a hyper parameter v to adjust the sparsity of the support 
vector set (sec. I2.2."6|) . finding a suitable sized set demands multiple trials with 
either changed v or kernel parameter a. Because we have no other means to 
determine a good a, we decided to vary it and keep the sparsity parameter fixed 
at v = 0.1. 
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Filter 1-10 (mean) Filter 1 1-20 (mean) Filter 21-30 (mean) 




698 21083248 5050 7973 698 21083248 5050 7973 698 21083248 5050 7973 



Number of SV's Number of SV's Number of SV's 



Figure 5.9: Mean slowness on the training set of the robot experiment for differ- 
ent SFA classes and kernel matrix approximations as number of support vectors 
(SV). 32 x 16 and 64 x 32 denote the image resolution. Note that linear SFA does 
not use support vectors and the theoretical solutions are the slowest functions 
possible. 



5.2.1 Slowness 

The foremost question is whether the considered SFA algorithms are able to 
produce a trigonometric representation of the robots position. Since SFA fil- 
ters are likely to be mixtures of optimal responses, a direct comparison is only 
possible in the slowest, unmixed filters. 

Instead, one can compare the slowness of SFA filters and optimal responses 
on the training set. We know that no filter can be slower than the optimal 
responses, so the difference in slowness can serve as a measure how well SFA 
has converged to a trigonometric representation. Because this does not involve 
any test sets, we compared the slowness in the robot experiment, as it is the 
most realistic case. 

In figure [5~I0l the slowness of all filters obtained by kernel SFA with different 
sets of support vectors are plotted. One can observe that the first filters are much 
closer to the optimum than latter ones, which are more likely to be influenced 
by (fast) noise. 



698 SV 
1677 SV 
4554 SV 
7704 SV 
optimal 

1000 2000 3000 4000 5000 6000 7000 

SFA Filter 

Figure 5.10: Slowness of all SFA filters in the robot experiment for different 
number of support vectors used by kernel SFA. Note how close the slowest 
filters of the 7704 SV line are to the optimum. 
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Figure 5.11: Comparison of the first 6 filters of (a) the robot experiment with 
7973 SV and (b) the simulated experiment with 8064 SV. 



To see how close the results are to a trigonometric representation, figure [5791 
plots the mean slowness every 10 filters against the number of support vectors 
used. Note that linear SFA is comparable to kernel SFA with ~ 700 SV and 
that the slowness seems to converge against the optimum with more SV. 

Another noteworthy result depicted in figure l5~9l is that a higher image reso- 
lution (64 x 32) performs worse at equal model complexity. Given that a higher 
resolution represents a more complicated problem, it makes sense that easier 
problems perform better. It can be expected that this relationship will switch 
at a high model complexity, since the high resolution images contain more in- 
formation. Due to current computational limitations, we were not able to verify 
this prediction. 

5.2.2 Responses 

Having established the performance of kernel SFA in terms of slowness on the 
training set, we now want to verify the predictions on test sets. Because of the 
inherent inaccuracy of the robot experiment test set, we will focus mainly on 
the simulators results. 

Figure 15.111 shows test maps of the first 6 filters of both experiments in 
two opposite directions. While the first filter looks more or less the same, the 
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Figure 5.12: SFA filter F\ of the simulated experiment resembles optimal re- 
sponse 4>wo- 
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Figure 5.13: SFA filters 2 and 3 of the simulated experiment with 8064 SV re- 
semble mixtures of optimal responses </>200 and <^>ooi , as predicted in proposition 
PI 



later ones differ significantly. Since the trajectory (and therefore the movement 
statistics) are the same, the only explanation is the difference in images, which 
led to another solution. 

We can also observe that the first filter in both experiments look like the 
optimal response (j) 100 , which is depicted in more detail in figure 15.121 In the 
robot experiment, the second filter also resembles (/>2oo- In contrast, the second 
simulator filter is not even orientation invariant. The latter situation is shown 
in figure 15.131 As theoretical predicted in proposition 14. 5[ SFA filters with the 
same slowness can mix, e.g. F2 « a</>200 + 6</>ooi and F3 « 6</>2oo — a</>ooi as long 
as a 2 + b 2 = 1 . We see here a mixture of the second orientation invariant basis 
function in horizontal direction fooo (filter 2 in the robot experiment) with the 
first place invariant basis function i/>ooi- 

Filters above 3 are hardly similar to any optimal response, but as filter 2 
and 3 showed, they might as well be arbitrary rotations in the subspace of basis 
function with equal slowness. 

5.2.3 Other environments 

As close as the previous results came to the predicted responses, it is of great 
practical interest how far they will differ when we leave the domain of theo- 
retically predictable environments. The responses of the two-room experiment 
should give us some insight at a simple yet common example. 

Figure [5 . 141 shows the response of first 6 filters from a birds perspective. In 
difference to the rectangular room, the filters seem to specialize in single rooms. 
Interestingly, in the chosen room the response look like a trigonometric basis 
function. For example, the filters 2 and 6 resemble the optimal response <^ioo 
in one room, but are near zero (green color) in the other. In this sense, filters 3 
and 5 resemble optimal response 4>ow- Higher filters are less intuitive, since as 
before they mix strongly with orientation sensitive responses of equal slowness. 
However, in the limit, one would expect two full sets of trigonometric basis 
functions, that each span one room exclusively. 
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a) Right Room b) Left Room c) Unregular 

*r^i ^p^n >r^i 

Figure 5.14: Mean response of the first 6 filters of the two-room experiment with 
6800 SV. The specific room shape favors filters that are exclusive for one room 
(a, b) but some do not fit this scheme (c). 




As promising as these results look in terms of function approximation, un- 
expected filters also emerge. For example, filter 1 resembles a room ID function 
which is positive (red) in the left and negative (blue) in the right room. While 
this somehow still fits in the room specific framework described above, filter 4 
seems to be active in the center of both rooms. 



5.2.4 Discussion 

We were able to verify that, with growing sets of support vectors, kernel SFA 
filters converge to trigonometric basis functions of the robots position. At some 
examples, we could also find evidence for other theoretical predictions, like mix- 
ture of basis functions with equal slowness. Leaving the area of predicted room 
shapes, the results also look promising, albeit not all filters resemble trigono- 
metric basis functions. 

In conclusion, SFA indicates to be a well suited preprocessing for linear 
value approximation. The inherent property of mixtures in basis functions, on 
the other hand, forces us to extract the complete state space. These mixtures 
seem to follow a broad interpretation of "equal slowness", i.e. all features but 
the first few mix with at least one other. Without any means to un-mix the 
basis functions, we can not exclude unwanted states (like humans, sun position 
or blinking lights), and are therfore restricted to static scenes. 

The adaption to a two room environment apparently led to a functional basis 
for each room, and some unique additional filters. A partition in separate rooms 
can actually be seen as an advantage. For example, it reduces the number of 
involved filters when the behaviour is learned in one room only. This could be 
exploited by a clever reinforcement algorithm to reduce the computation time. 

Some filters do not fit in this scheme and seem to be useless in terms of func- 
tion approximation. However, without analytical solution it is hard to validate 
such statements. 
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5.3 Learn navigation 

First, we want to validate the performance of policy iteration with a trigono- 
metric representation (sec. 15 . 372]) . To compare policies, we also define a quality 
measure named convergence quality and show how well the learned control be- 
haves compared to a near optimal hand-crafted reference policy. 

In section 15.3.31 we will learn navigational control based on video images 
only. Here we validate the advantage an SFA preprocessing yields and how 
alternative room geometries influence the performance. 

The algorithms Given a training set in any linear representation (dcf. 13. 7|) . 
we use LSPI (algorithm O to learn the optimal policy. LSPI employs LSQ 
(algorithm [B]), which in turn calls LSTD (algorithm [5]), to estimate Q-values. 
We chose a discount factor of 7 = 0.9 in the laboratory and 7 = 0.95 in the two 
room environment because the latter was of much bigger size. 

After determining the best policy, the control (algorithm [8]) navigates by 
letting the policy choose one of the 3 actions: move forward ca. 30cm, rotate 
left 45° or rotate right 45° . 

The representation, the reward distribution, and the training set influence 
the resulting policy. 

Robot training set We were able to extract a set of 3091 rotations and 5474 
movements out of the random walk video recorded by the robot. After mirroring 
the rotations (if X2 is 45° to the right from x\, then x\ is 45° to the left of X2), 
we reached a training set of 11656 transitions. 

A reward of +1 was given for entering the goal area, and a punishment of 
-1 for coming too close to the walls (< 50cm). We tested two goals, the center 
of the room with a radius of 20cm (fig. 15.15a ) and the lower right corner with 
a radius of 45cm (fig. 15.15b ). 




a) Goal in the Center b) Goal in the lower right Corner 



Figure 5.15: The test set (dots) for the robot- and simulation- experiment, colored 
in received reward (blue: -1, green: 0, red: +1), for two different goals. 
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Figure 5.16: The test set (dots) for the two-room experiment, colored in received 
reward (blue: -1, green: 0, red: +1). 



Two room training set As discussed in section [5.1. 6[ we sampled the ran- 
dom transitions uniformly. Because the environment is larger, we decided to 
use a training set of 50,000 transitions. Figure [5.161 shows the distribution and 
received reward of the used training set. 

It took the author some time to find a working reward distribution. There 
are many hyper parameters to set and some were surprisingly influential. Two 
observations stroke us as particular noteworthy: 

(1) The discount factor 7 has an undeniable impact on the learned control. 
Large environments seem to require large 7. One explanation would be the ex- 
ponential decay of value the more one departs from the goal area. The quadratic 
cost function of LSTD distributes the approximation error equally on all sam- 
ples. As a result, areas of close-by Q- values will be more likely to pick the wrong 
action. This includes all areas far away from the goal where the over-all value 
is small. Therefore, large environments need to "spread" the positive reward 
further, i.e. need a larger 7, to counter this effect. 

(2) When the punished distance to walls is too large, the positive reward is 
not propagated in the other room. Most likely, the close proximity to negative 
reward prevented the distant positive reward to be propagated through the 
corridor. However, since this is no theoretical handicap, it must be a result of 
weak approximation at this critical point. 

5.3.1 Policy evaluation 

Evaluating a learned policy is more complicated than in the case of, for example, 
regression. Applying the policy on a representative test set (e.g. a test map as 
described in section I5.1.6[) shows us the distribution of actions, but does not 
allow any conclusions about the controls behaviour. For example, there could 
exist "loops" from which the robot can not escape and never reach the goal. 
Therefore wc have to evaluate the trajectories, produced by the policy. 

The initial task of this experiment was to reach the goal area as quickly as 
possible. Counting the steps to the goal t(tt,x ) would be the most intuitive 
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measure how well a policy n fulfills this task at some start position x<j. However, 
since we are interested in the quality of all possible trajectories, an area of 
"loops" (i.e. that can not be left) would drive this measure to infinity. 

The approach we want to take here uses the value V^(x n ) to evaluate the 
policies fitness. Given the simplified case that we ignore punishment at walls 
and treat the goal area as absorbing states, this measure is a modification of 
the former: V 7r (xo) = y T y K > Xo )~ 1 . On the one hand, this measure can handle 
infinitely long trajectories. On the other hand, compared to the optimal policy 
7r* , unnecessary detours near the goal have more influence on the measure that 
those far away. When this is not desired, one can normalize the result with 
the optimal value V r (xq). We will call the mean of this ratio the convergence 
quality: 

Definition 5.1 (Convergence quality) Let p(x) be a uniform distribution 
over the state space of x, except the goal area, and let V 7r (x) denote the value 
of state x, with a reward of 1 at the absorbing states of the goal area. Given the 
optimal policy tr* , the following will be called convergence quality: 

C(n)= J ^j^ P (x )dx (5.1) 

This measure of policy convergence has some nice properties: 

• C(ir) £ [0, 1] where 1 indicates that n = n* and that no trajectories hit 
the goal. 

• log 7 (C(7r)) w J (t(tt,xq) — t(7t*, x ))p(x )dx (mean difference in r(-, ■) 
between tt and 7r*), but does not go against infinity in the presence of 
loops. The approximation holds only for high C(ir). 

Evaluating the integral in definition 15.11 is not possible but of course can be 
approximated given example trajectories. 

Reference policy The definition of convergence quality requires the optimal 
policy 7T* that will depend largely on the reward distribution. Building a control 
that maximizes the sum of expected reward by hand is hardly possible. Instead 
one can define a policy that aims to reach the goal as quickly as possible. Though 
this is not it* , there is probably little difference in in terms of t(tt* , ■). 

Given the coordinates of robot and goal, one can define a simple greedy 
policy (algorithm ITT]) . The algorithm does not describe the best possible policy, 
but is probably close to it. 

Two room environment To use the convergence quality measure in the two- 
room experiment, one can enhance algorithm [TTJ by subgoals at each end of the 
corridor. The new policy would first determine which goal is most reasonable 
and then call algorithm [TTJ with the selected (sub) goal. 
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Algorithm 11 Reference policy 

Require: p 6 IR 2 ; 9 6 [0, 2n] / / Position p and orientation 9 of the robot 
Require: g € IR 2 / / Center of the goal area g 
A9 = 8 — direction_of_vector(g — p) 
if |A0| < 45° then 

return(MOVE.FORWARD) 
else 

if A9 < then 

return(TURN_LEFT) 
else 

ret urn ( TURN.RIGHT) 
end if 
end if 



5.3.2 Artificial input 

Before we apply SFA as a preprocessing, we want to evaluate LSPI on the 
predicted optimal responses, i.e. trigonometric polynomials. We used the same 
training set, but instead of the video images we presented the real position 
in a trigonometric representation. Besides the question how well LSPI will 
perform, we are also interested in potential overfitting and other factors that 
would complicate the main experiment. 

SFA sorts its filters with respect to slowness on the training set, so we did 
the same with the optimal responses predicted in proposition 14.41 We omitted 
mixtures of filters (which appear in SFA, as verified in the last section) , because 
LSTDs result is invariant to rotations of the input space. 

Figure 15.171 shows on the left the convergence quality C (jr) for the training 
set (size n = 11656) respectively the first half of it (n = 5838). The goal area 
resides in the center of the room. The evaluated policies were trained with 
state representations of a dimensionality d between 10 and 500, i.e. using the 
d — 1 slowest optimal responses and a constant. However, using huge state 
representations, policies often did not converge at all, so we evaluated every 
experiment 10 times with different initial policies. The colored areas show one 
standard deviation and demonstrate how reliable LSPI works. 

Though one can observe a regime in which the convergence quality depends 
largely on the initial policy, it is unclear why. Apparently, some initial actions 
get in the way of convergence while others lead to good results. This might be 
due to slight overfitting in the initial value estimation which gets amplified by 
policy iteration, but there is no analytical explanation that would back up the 
experimental data. 

A simple way to circumvent this problem is to use all possible actions at 
once, simulating a policy that truly chooses them at random. Superficially, 
this would triple the training set, but due to the state- action representation 
in use (def. 13. 13|) it can be implemented without any computation overhead. 
The results of this approach are plotted as dashed lines in figure 15.171 Since 
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Figure 5.17: Left: Convergence quality C(w) on artificial state representations 
for different training set sizes n. Every data point on the solid line represents 
the mean over 10 trials with different initial random policies, evaluated on 1000 
trajectories each. The colored area is the respective standard deviation. The 
dashed line represents a modified initial policy. Right: log (C(tt)) is an approx- 
imation of the the mean difference in stcps-to-goal between the learned and the 
reference policy. 



no randomness is involved anymore, one experiment per representation size is 
sufficient. 

However, to provide a more intuitive comparison, the right side of figure 
15.171 shows log (C(7r)), which can be considered an approximation for the mean 
number of additional steps the learned policy needs, compared to the reference 
policy. Note that the approximation becomes less reliable for low convergence 
quality values. 

In conclusion, one can observe: 

1. Using a suitable number of optimal responses (small set/blue area: 20-40, 
full set/red area: 20-70). LSPI converges reliably to a policy comparable 
to the reference policy. The mean difference in terms of t(tt, ■) in this 
regime is below one step. We will call this the working regime. 

2. More optimal responses (i.e. SFA filters) do not automatically produce 
better policies. While few input dimensions yield a similar (even though 
not always good) convergence quality, results of higher dimensionality 
become unstable. Note that this effect is very similar in both curves, but 
appears earlier in the smaller training set. This indicates that the effect 
is the result of overfitting. 

3. Sampling all possible actions at once improves the initial policy, doubling 
the working regime in size. However, since the cause of the unstable regime 
is unclear, no theoretical statement guarantees an improvement. 
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Figure 5.18: The same experiments as in figure IS". 1 71 with n = 11656 samples, 
but on simulator video images preprocessed by SFA. One trial per data point, 
which were evaluated at 200 trajectories, each. 



5.3.3 Video input 

Backed up by the knowledge that LSPI works on the theoretically predicted 
output of SFA, we proceed with the analysis of LSPI on preprocessed video 
images. 

Simulated experiment First we wanted to know where the working regime 
with respect the number of used SFA filters is. We therefore repeated the former 
experiments with the simulated experiment preprocessing described in section 
15.21 Utilizing the initial policy modification discussed above, we eliminated the 
need of multiple trials. 

As one can observe in figure [533 onc needs much more SFA filters to reach a 
comparable quality. As a side effect, it seems as if we did not reach the overfitting 
regime, at all. The policies trained with 80 to 500 SFA filters seem to be in the 
working regime. Though the value is low compared to the predictions, the right 
side of figure 15.181 shows that this translates only to 2 to 3 steps more than the 
reference policy. 




Figure 5.19: Test trajectories of the robot experiment. The blue triangles mark 
the start positions and the numbers the length of each trajectory which reached 
the red goal area. 
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Figure 5.20: The same experiments as in figure lSTTTl but with n = 50000 training 
samples in the two-room experiment with artificial state representation. 

Robot experiment Based on this experiment, we chose d = 128 SFA filters 
in the robot experiment. A test trajectory driven by a real robot took the 
author between 5 and 10 minutes and thus generating enough trajectories to 
estimate the convergence quality was not possible. Figure shows all 20 test 
trajectories and the time each took. With the exception of the lower left area 
in the right plot the control seems to work reasonably well. However, the steps 
needed from near by positions vary and most trajectories took surprisingly long. 

5.3.4 Other environments 

Now that we have established our method to be working in simple cases, we 
want to examine the more complex case of the two- room environment. For 
once, the length scale is much larger. For example, crossing the environment 
from left to right, previously done in 9 move-actions, requires 45 actions now. 
Moreover, the non-rectangular shape will lead to non-concave value functions 
with sharp edges at the walls of the corridor. One has to expect that at least 
the latter will put a strain on the value approximation quality. 

Reviewing the results of section I5.2.3[ the diversification of the filters in 
room dependent trigonometric basis functions appear in a new light. Since the 
covered areas are convex, abrupt changes in the value function appear only at 
the contact point between rooms and therefore filters. 

However, since we do no know the optimal SFA responses of the two room 
environment, we decided to stick to the known artifical responses as described 
in section [5321 The convergence quality with respect to the number of optimal 
responses is shown in figure 15.201 Compared to the original task depicted in 
figure I5TT1 one needs much more responses (at least 100) to reach the working 
regime. Also the convergence quality is much lower, translating in the best cases 
to ca. 4 unnecessary steps. 

To sum it up, the task has proven to be much more complicated. Besides, 
the SFA task is probably harder too, leading to a more distorted state represen- 
tation. However, we performed a series of test with preprocessed video images 
from the simulator, using the first 10 to 500 filters. In terms of convergence 
quality, no learned control differed much from the base line in figure IST201 
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5.3.5 Discussion 

We successfully conducted the main experiment. Assuming optimal responses, 
as few as 20 filters are sufficient to learn a near optimal control. Using too 
much filters, however, produces catastrophic overfitting. Switching to SFA pre- 
processed video images yields, at least in principle, the same result. However, 
we need much more filters (at least 80) and the resulting control needs consid- 
erably more steps to reach the goal area. Albeit not as well validated as the 
simulator experiment, our method has proven to work on a real robot as well. 

However, both successful experiments based on video images exhibited er- 
ratic behaviour at some points. The robot first gets caught or is "undecided" 
between left and right rotations but then escapes without obvious reason. Close 
observation reveals the effect to be present only (but not always) when actions 
change (in opposition to continuous driving or rotation). At these positions two 
or more actions must have similar Q-values and noise in the representation can 
induce erratic behaviour. This kind of behaviour has not been observed using 
artificial state representations, which supports the conclusion. 

5.4 Conclusion 

We wanted to give a proof of concept that LSPI can learn a efficient control 
based only on the video images preprocessed by filters learned by SFA. Given the 
results of the last section, we can say we have. However, we also demonstrated 
that with the current computational capacity we can only solve simple problems, 
yet. 

From our analysis we can identify the most likely sources of error: 

• Sensor errors: Real video images are not constant over time, so the SFA 
output will always be afflicted by noise. While noise afflicted transitions 
were always part of the policy iteration framework, up to now noise af- 
flicted state representations received little attention. The effects of this 
noise appear in form of the discussed "uncertain" behaviour. 

• Preprocessing errors: SFA output does not resemble trigonometric 
polynoms perfectly. This can induce local anomalies and therefore local 
value estimation errors. For example, the simulated experiment requires 
much more SFA filters even though sensory noise does not play a role. 

• Approximation errors: Abrupt changes in the value function (e.g. at 
walls) can lead to local approximation errors, because the function class 
at hand can not represent the slope. For example, in the artificial two- 
room experiment one needs much more optimal responses, due to the tight 
corridor. 

However, the most dangerous source of error is over- and underfitting. While 
undcrfitting is simply a question of computational power, overfitting could be 
lessened by rcgularization and more training samples. 



Chapter 6 

Conclusion 



6.1 Summary 

Applying reinforcement learning methods onto real-world scenarios bears one 
overwhelming obstacle: The state of the world. The most common mathemat- 
ical formalism of Markov decision processes (MDP) requires at all times the 
complete state in a suitable representation. Even if one assumes real- world sen- 
sor readings to hold the complete state, its complex and noise afflicted structure 
is hardly suitable for any learning algorithm. 

The most common approach to this problem are hand-crafted heuristics 
to filter state information out of raw sensor data. However, since unforeseen 
situations and interactions will occur, this approach is bound to fail outside the 
controlled conditions of a laboratory. Biological systems, on the other hand, 
demonstrate the ability to learn efficient representations of their environment. 
These representations do not only represent the environment well, but can also 
adapt when conditions change. Thus, faced with real- world scenarios, one should 
rather learn a representation of sensor data than simply define & heuristic one. 

In this thesis we have explored the case of robot navigation. As a handicap, 
the only accessible state information were the images of a head mounted video 
camera. We chose a well known reinforcement learning algorithm, least squares 
policy iteration (LSPI), to learn a navigational control. The only reinforcement 
signal were a reward in the goal area and a punishment near walls. 

LSPI is based on the least squares Q-value algorithm (LSQ), which fits a 
linear function to approximate the Q-value of all state-action pairs. After learn- 
ing, the control will choose the action with the highest Q-value in the current 
state. Given the state in form of the current video image, a linear function is 
not powerful enough to approximate the Q-value function well. 

From approximation theory we know of several sets of basis functions, that 
are particularly suited to approximate continuous functions by linear weighting. 
To allow LSPI to learn the task, we therefore must aim for a set of filters that 
extract the state in the form of suitable basis functions. This thesis is focused on 
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the unsupervised method of slow feature analysis (SFA), which produces filters 
that are in the limit equivalent to trigonometric polynomials. 

Approach 

Assume a static environment in which a wheeled robot can navigate by moving 
forward and rotating. The only non-static variable is the robots position and 
orientation. Since all possible images depend exclusively on these variables they 
form the complete state. In this environment, we record a random walk in which 
orientation and position changes slowly. 

The main principle of SFA, temporal coherence, aims to construct filters that 
minimize the change in the filtered training set. To avoid trivial solutions, the 
filters output is bound to change by constraining it to unit variance. Another 
constraint, decorrelation, ensures a set of orthonormal filters. 

Since these constraints force the filtered output to change, the slowest change 
will follow the state of the random walk. The specific optimization problem 
employed by SFA leads to trigonometric polynomials as optimal solution, i.e. 
assuming infinite number of training samples and unrestricted model class. 

The main experiment in this thesis consists of an initial random walk video 
on which we learn a set of filters with SFA. After applying the learned filters 
onto the initial video, the output is used as input of LSPI. We obtain a linear 
weighting of filter outputs for every action, which approximates the Q-value. 
Given a current video image, the control will select the action with the highest 
Q-value, i.e. that promises the most future reward. In our case that means 
to navigate as fast as possible into the goal area, since it is the only available 
reward, while avoiding walls. 

Results 

We examined two environments. The first was a simple rectangular laboratory 
area, which was evaluated with a real robot and as a simulation. The sec- 
ond consisted of two quadratic rooms connected by a small corridor. Here we 
performed only simulated experiments. 

In the simulation of the rectangular environment, our method produced 
controls that differed only around 2 to 3 steps per trajectory from the optimal 
choices. The real robot also succeeded, but reached the goal only in 80% of the 
test trajectories and took significantly longer than expected. 

The more complicated two-room environment did not yield a working con- 
trol. However, tests under optimal conditions show that the task is significantly 
more complex than the former one. It is the authors believe that with more 
computational resources in both phases, SFA and LSPI, a successful control can 
be learned. 
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6.2 Discussion 

After presenting the methodology and results in the last section, we want to 
discuss some problems the author noticed during his work on this thesis. The 
discussion will hrst review general flaws, followed by a close look into detailed 
problems and some suggestions how to overcome them. 

We will pick up some of the topics in the next section to make concrete 
suggestions on possible future advancements. 

Slow feature analysis 

Despite the extension to kernel SFA, slow feature analysis still suffers perfor- 
mance issues, as the failure of the two-room experiment shows. However, even 
if all practical problems can be solved, one has to ask if a task independent 
preprocessing can handle real scenarios at all. 

Without any insight in the task at hand, the extracted basis functions have 
to cover all state space dimensions and all combinations of them. Given a 
certain approximation quality in every dimension (e.g. polynomials up to some 
degree), the number of required filters grows exponentially in the dimensionality 
of the extracted state space. For example, a trigonometric representation of a 
p dimensional state space with degrees (di, . . . ,d p ) and : di « dj grows 
exponentially in p. Due to mixtures in all but the first few filters, we can not 
exclude unwanted dimensions or combinations, even if we would know them. 

Even more, we can not be certain that the world can be modelled by a finite 
dimensional state space. A complex system, e.g. a laughing human or a broad- 
leafed tree in the breeze, would require an inconceivable high dimensionality. 
Even with a model class that can represent the whole space, the number of 
required training samples would go against infinity. 

Thus, extracting the complete state space is only feasible in very controlled, 
unique cases. In the end we would like to extract only the subspace that is useful 
for the general "class" of problems at hand. For example, if every considered 
task in some room is independent of the horizontal position, a trigonometric 
representation of vertical position and orientation would be sufficient for any 
navigational control. In this context the state space must not represent some 
underlying "real" causes (e.g. of muscle movement) but more problem depen- 
dent "mcta" causes (e.g. smiling). Section 16.31 will present a more concrete 
proposal how to achieve this. 

• With the exception of the simple case of rectangular rooms, there are 
no theoretical predictions for optimal responses. Given the variability 
of environmental states, this can be seen as a major disadvantage. For 
example, imagine a robot arm with 5 joints. Ideally, one would expect 
a 5-dimensional combination of free- or cyclic-boundary conditions, but 
as in the two room case, the arm movement will be constraint by the 
environment. Therefore the optimal responses will be unpredictable in all 
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but the most simple cases. However, it is not clear if these unexpected 
responses will facilitate or limit the filters approximation capability. 

• This thesis always assumed the current state of the environment to be 
encoded in the current video image. Of course, this is not always the case. 
In line with the simultaneous localization and mapping (SLAM) method, 
one could include short term memory to track the current position in 
such cases. However, this would require either a complete reformulation 
of the optimization problem or a sophisticated post-processing of the SEA 
output. A much simpler heuristic solution is presented in section [6.31 

Policy iteration 

In the tabular representation, i.e. with small set of discrete states, policy iter- 
ation is guaranteed to converge monotonically to an optimal policy. This guar- 
antee, however, does not extend to the realistic case of approximated Q-value 
functions. 

The successive application of a policy estimation and greedy policy improve- 
ment step reminds of the expectation maximization algorithm (EM, e.g. [9]). In 
a discrete version (e.g. K-means, e.g. [5]), this popular algorithm is prone to 
become stuck at local minima. This effect resembles the results of policy itera- 
tion in the unstable regime (see sec. 15.3.2]) . In difference to policy iteration, for 
EM a number of probabilistic extensions exist (e.g. soft K-means, e.g. [5]), that 
greatly reduce (albeit not eliminate) this problem. 

However, as the modified initial policy in section 15.3.21 demonstrated, it is 
possible to use probabilistic choices instead of a pure greedy policy. In section 
IG.3I we will discuss a consideration of error sources (discussed below) into the 
policy improvement step. When these problems are discussed in literature, the 
only considered errors are due to a weak model class. In practice, however, two 
other sources of error have come to light: 

• Every real preprocessing will exhibit some anomalies. The resulting local 
approximation errors can be amplified by the policy iterations and thus 
ruin an otherwise good policy. 

• Since real sensors are afflicted by noise, the control will exhibit erratic 
behaviour in areas in which Q-values of the best actions are close by. In 
the worst case, oscillating behaviour can occur, e.g. turning back and 
forth. 

Application 

At last we have to review the general applicability of the presented algorithms: 

• Restricting oneself to discrete actions greatly reduces the area of appli- 
cation. The only way to circumvent this, discretizing continuous actions, 
increases the number of actions immensely. Because the Q-value estimator 
LSQ scales cubic in the number of actions, this quickly becomes infcasiblc. 
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• Constructing a suitable reinforcement learning scenario is not as intuitive 
as is might look like. In particular the choice of hyper parameters (7, num- 
ber of filters and amount of reward/punishment) influences the resulting 
policy considerably. 

• The presented method looks promising, but sticks out by its high com- 
putational demand. Due to its biological motivation, its computational 
elegance does not transfer to von Neumann architectures of most modern 
computers. From this background it looks natural to use parallel process- 
ing architectures, e.g. graphic cards, programmable logic arrays or vector 
processors. Indeed, with these one could consider far more complex models 
in reasonable time. 

However, both SFA and LSPI algorithms are in the presented form not 
parallelizable. Besides many matrix multiplications, which can be paral- 
lelized well in a shared memory architecture, the key elements are matrix 
inversions or eigenvalue decompositions. There exist parallel versions of 
both problems, but they are not as efficient and may introduce additional 
side effects. 

In the end, however, the human brain performs these tasks in a massive 
parallel processed fashion. This alone makes it a promising approach. 

6.3 Outlook 

As discussed above, application of reinforcement learning to real- world scenarios 
still bears many problems. The author wants to give here only a small selection 
of potential improvements, that followed directly from his work on this thesis. 

Exclude unwanted states from SFA 

To apply the proposed method to non-static environments, one has to exclude 
unnecessary state subspaces from the SFA solution. Due to mixtures of the 
optimal responses, one can not directly exclude single filters. Instead, the opti- 
mization problem has to include a term that penalizes dependence on unwanted 
dimensions. It is important that the resulting filters do not specialize too much 
in the task at hand. For example, in a navigation task a set of filters that 
support only policies that drive to the center of the room would not work with 
another goal. 

In line with canonical correlation analysis and partial least squares [8], one 
possible approach would maximize the correlation of the filters to a target value. 
As a complication, we do not know a target value directly, but only the experi- 
enced reward. 
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Short term memory in preprocessing 

A thorough consideration of holding a current state would require a complete 
reformulation of the SFA optimization problem. However, an easy considera- 
tion of the predecessor state (e.g. temporal smoothing xt = r)Xt-i + (1 — rf)4>t) 
could stabilize the current state estimation. Since the state Xt-i depends on 
all previous observations 4>q, . . . , <fit-i, the same position can be represented by 
different states x t , depending on the past. This will induce additional "noise" 
in the representation, so the parameter rj should be chosen very small. 

Consider local errors in policy iteration 

In section [5.3.2l wc introduced a simple modification to the initial random policy 
of LSPI. Instead of sampling the second action randomly, we considered all 
actions: <fi'(x) = tjt J2aeA ( l ) ( x ' a )- ^ n the following iterations, however, we 
chose this action greedy. If we review the sources of error from the previous 
section, we see that these errors are not distributed equally over the state space: 

• Weak function classes (e.g. not enough filters) have problems to approx- 
imate the slope of abrupt changes in the reward. However, most areas of 
the value function are smooth enough. 

• In areas with similar Q-values noise affected controls will show erratic 
behaviour. Where Q-values differ enough, the control will be stable. 

• SFA filters tend to differ from the optimal responses locally, when the 
considered function class can not map an image properly. 

The first step would be to determine these local errors. With errors and Q- 
values at hand, one can calculate a realistic probability p(-) for an action to be 
chosen by a greedy policy. Instead of just choosing one action greedy, one could 
learn all possible actions at once: 4>'(x) — J2 aeA p(a) 0(x,a). 

After the experience with the new initial policy, one can hope that the re- 
sulting policy iteration will be more stable and therefore converge better. 

Policy dependent norm 

LSPI is based on the squared Li norm, which means it aims to distribute the 
approximation error equally on all training samples. However, as we have seen 
in our previous discussion, areas with nearby Q-values need much better ap- 
proximation than those far-off. It would improve the reliability of the control if 
one would use a policy dependent norm, i.e. put more weight on samples with 
similar Q- values. 
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